1.背景知识
众所周知,limma可以做基因表达芯片和转录组数据的差异分析,可以方便的得处两组之间有表达量差别的基因。多分组差异分析其实也是两两差异分析的批量做法。
有的实验设计相对复杂一点,比如今天使用的例子:
实验设计包括了两种分组,一个是siC和sip53,一个是NT与TNF。
那么常规的两组之间的差异分析也就不能同时分析这两种实验设计啦!
这时候我们就需要用到双因素差异分析。
1.整理表达矩阵
rm(list = ls())
library(tinyarray)
library(tidyverse)
library(limma)
gse = "GSE53153"
a = geo_download(gse,destdir=tempdir(),colon_remove = T)
find_anno(a$gpl)
library(illuminaHumanv4.db);ids <- toTable(illuminaHumanv4SYMBOL)
eset = trans_array(log2(a$exp+1),ids)
eset[1:4,1:4]
## GSM1283060 GSM1283061 GSM1283062 GSM1283063
## EEF1A1 16.125688 16.097172 16.035905 15.996306
## GAPDH 15.125034 15.114759 15.188604 15.102693
## SLC35E2A 7.980711 8.234099 7.651769 8.134426
## CLXN 7.634448 7.811856 7.692790 7.792465
2.整理实验设计
#双因素分析
targets = data.frame(mutate = rep(c("siC","sip53"),each = 6),
induce = rep(c("untreat", "treat"), each = 3, times = 2))
targets
## mutate induce
## 1 siC untreat
## 2 siC untreat
## 3 siC untreat
## 4 siC treat
## 5 siC treat
## 6 siC treat
## 7 sip53 untreat
## 8 sip53 untreat
## 9 sip53 untreat
## 10 sip53 treat
## 11 sip53 treat
## 12 sip53 treat
把这两种分组合并到一起
TS <- paste(targets$mutate, targets$induce, sep=".")
TS <- factor(TS, levels=c("siC.untreat","siC.treat","sip53.untreat","sip53.treat"))
TS
## [1] siC.untreat siC.untreat siC.untreat siC.treat siC.treat
## [6] siC.treat sip53.untreat sip53.untreat sip53.untreat sip53.treat
## [11] sip53.treat sip53.treat
## Levels: siC.untreat siC.treat sip53.untreat sip53.treat
3.做差异分析
这里同时做了三次差异分析:
哪些基因对siC组的treat有反应,即siC组的treat-untreat
哪些基因对sip53组的treat有反应,即sip53组的treat-untreat
与siC组相比,哪些基因在sip53组中的反应不同,即上面两组差异分析所得的logFC再比较!
前两次差异分析其实相当于取子集+二分组差异分析。第三次是双因素差异分析咯。
design <- model.matrix(~0+TS)
colnames(design) <- levels(TS)
fit <- lmFit(eset, design)
cont.matrix <- makeContrasts(
w = siC.treat-siC.untreat,
m = sip53.treat-sip53.untreat,
diff = (sip53.treat-sip53.untreat)-(siC.treat-siC.untreat),
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
deg1 = topTable(fit2,number = Inf,coef = 1)
deg2 = topTable(fit2,number = Inf,coef = 2)
deg3 = topTable(fit2,number = Inf,coef = 3)
head(deg3)
## logFC AveExpr t P.Value adj.P.Val B
## MMP9 -1.889600 8.014973 -17.520398 8.191150e-12 1.712278e-07 13.259648
## CXCL10 -1.598374 9.294062 -12.550379 1.166271e-09 1.218986e-05 10.453810
## PI3 1.458585 8.562082 10.029716 2.812942e-08 1.960058e-04 8.248613
## EBI3 -1.382219 8.596838 -9.797690 3.885369e-08 2.030494e-04 8.008927
## LTB -1.412194 8.331969 -8.950493 1.327325e-07 5.549282e-04 7.072795
## TNIP1 -1.048539 12.778876 -8.828485 1.594793e-07 5.556259e-04 6.929717
4.差异基因
get_diff_gene = function(deg,logFC_t = 1,p_t = 0.05){
k1 = (deg$P.Value < p_t)&(deg$logFC < -logFC_t)
k2 = (deg$P.Value < p_t)&(deg$logFC > logFC_t)
library(dplyr)
deg = mutate(deg,change = ifelse(k1,"down",ifelse(k2,"up","stable")))
table(deg$change)
return(rownames(deg)[deg$change!="stable"])
}
cg1 = get_diff_gene(deg1);length(cg1)
## [1] 121
cg2 = get_diff_gene(deg2);length(cg2)
## [1] 109
cg3 = get_diff_gene(deg3);length(cg3)
## [1] 14
dat = list(siC = cg1,
sip53 = cg2,
diff = cg3)
draw_venn(dat,"")
ggsave("v.png")
dat = data.frame(t(eset[cg3,]),
targets,
TS,check.names = F)
library(ggplot2)
ps = lapply(cg3, function(g){
ggplot(dat,aes_string(x = "TS",
y = paste0("`",g,"`"),
fill = "induce"))+
geom_boxplot()+
theme_bw()+
theme(legend.position = "top")
})
library(patchwork)
wrap_plots(ps[1:4],ncol = 2)
5.logFC 是如何计算的
TS
## [1] siC.untreat siC.untreat siC.untreat siC.treat siC.treat
## [6] siC.treat sip53.untreat sip53.untreat sip53.untreat sip53.treat
## [11] sip53.treat sip53.treat
## Levels: siC.untreat siC.treat sip53.untreat sip53.treat
x = eset["MMP9",]
(mean(x[10:12])-mean(x[7:9]))-(mean(x[4:6])-mean(x[1:3]))
## [1] -1.8896
deg3$logFC[1]
## [1] -1.8896
所以也就是计算出了两组之间的logFC之差。
代码参考自limma帮助文档第九章。