clusterProfiler 是业界大神Y叔写的一个R包,可以用来做各种富集分析,如GO、KEGG、DO(Disease Ontology analysis)、Reactome pathway analysis以及GSEA富集分析等。而除了富集分析,他还可以非常方便的对富集分析结果进行可视化。
这里使用clusterProfiler进行GO、KEGG以及GSEA富集分析。
1、安装clusterProfiler
安装clusterProfiler:
> source("http://bioconductor.org/biocLite.R")
> biocLite('clusterProfiler')
2、ID转换
对于没有转换的gene ID,clusterProfiler也提供了bitr
方法进行转换ID:
Usage:
bitr(geneID, fromType, toType, OrgDb, drop = TRUE)
Arguments
geneID input gene id
fromType input id type
toType output id type
OrgDb annotation db
drop drop NA or not
# example:
> x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2",
"ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1",
"NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1",
"COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC",
"ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT",
"MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK",
"STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B")
> eg <- bitr(x, fromType="SYMBOL", toType=c("ENTREZID","ENSEMBL"), OrgDb="org.Hs.eg.db"); head(eg)
'select()' returned 1:many mapping between keys and columns
SYMBOL ENTREZID ENSEMBL
1 GPX3 2878 ENSG00000211445
2 GLRX 2745 ENSG00000173221
3 LBP 3929 ENSG00000129988
4 CRYAB 1410 ENSG00000109846
5 DEFB1 1672 ENSG00000164825
6 HCLS1 3059 ENSG00000180353
可以看到,这里转换ID的对应文件来源于"org.Hs.eg.db"这个包。
3、GO、KEGG富集分析
3.1 GO富集分析
在开始富集分析之前先看看GO和KEGG富集分析的方法以及参数:
enrichGO GO Enrichment Analysis of a gene set. Given a vector of genes, this
function will return the enrichment GO categories after FDR control.
Usage:
enrichGO(gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10,
maxGSSize = 500, readable = FALSE, pool = FALSE)
Arguments:
gene a vector of entrez gene id.
OrgDb OrgDb
keyType keytype of input gene
ont One of "MF", "BP", and "CC" subontologies or 'ALL'.
pvalueCutoff Cutoff value of pvalue.
pAdjustMethod one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
universe background genes
qvalueCutoff qvalue cutoff
minGSSize minimal size of genes annotated by Ontology term for testing.
maxGSSize maximal size of genes annotated for testing
readable whether mapping gene ID to gene Name
pool If ont=’ALL’, whether pool 3 GO sub-ontologies
导入数据,这是一个整合数据,在这里我们要用到的只是entrez ID列和最后一列(logFC):
> library(clusterProfiler)
> degenes <- read.csv('D:/TCGA/microarray_analysis/intersectgenes_logFC_broad.txt',header = T,stringsAsFactors = F,sep = '\t')
> head(degenes)
Symbol GSM450153 GSM450154 GSM450155 ID P.Value Q.Value adj.P.Val RefSeq.ID Entrez.ID Fold.Change
1 AIM1 -0.7155 -0.8391 -2.3808 ENSG00000112297 4.518321e-02 0.287315306 0.287315306 NA 202 2.555354
2 AK5 -1.9269 -0.6967 -0.5628 ENSG00000154027 7.907520e-05 0.005415775 0.005415775 NA 26289 1.350051
3 ANXA3 4.2110 0.8687 -0.1016 ENSG00000138772 3.122002e-02 0.229588586 0.229588586 NA 306 -2.328736
4 ARHGAP15 -0.0725 -1.5821 -2.0469 ENSG00000075884 6.321948e-05 0.004553170 0.004553170 NA 55843 5.064183
5 ASGR2 1.5563 1.4054 1.2066 ENSG00000161944 1.474010e-02 0.146976591 0.146976591 NA 433 2.061535
6 ATM -1.4344 -0.4961 -1.9324 ENSG00000149311 1.403235e-02 0.142613364 0.142613364 NA 472 2.313447
> genelist <- degenes$Entrez.ID
# 检查是否有重复
> genelist[duplicated(genelist)]
integer(0)
由于clusterProfiler富集分析推荐的输入文件是Entrez ID,因此这里提取的是Entrez ID,接下来就可以进行富集分析了:
> go <- enrichGO(genelist, OrgDb = org.Hs.eg.db, ont='ALL',pAdjustMethod = 'BH',pvalueCutoff = 0.05,
qvalueCutoff = 0.2,keyType = 'ENTREZID')
> head(go)
ONTOLOGY ID Description GeneRatio BgRatio pvalue
GO:1903039 BP GO:1903039 positive regulation of leukocyte cell-cell adhesion 20/121 226/17381 9.074966e-17
GO:0050870 BP GO:0050870 positive regulation of T cell activation 19/121 215/17381 5.824805e-16
GO:0001819 BP GO:0001819 positive regulation of cytokine production 24/121 409/17381 7.118707e-16
GO:0007159 BP GO:0007159 leukocyte cell-cell adhesion 23/121 374/17381 1.118047e-15
GO:0022409 BP GO:0022409 positive regulation of cell-cell adhesion 20/121 259/17381 1.289708e-15
GO:1903037 BP GO:1903037 regulation of leukocyte cell-cell adhesion 22/121 343/17381 2.095005e-15
p.adjust qvalue
GO:1903039 2.474743e-13 1.898101e-13
GO:0050870 6.470905e-13 4.963113e-13
GO:0001819 6.470905e-13 4.963113e-13
GO:0007159 7.034066e-13 5.395051e-13
GO:0022409 7.034066e-13 5.395051e-13
GO:1903037 9.521797e-13 7.303114e-13
geneID
GO:1903039 84433/6352/1236/915/916/959/972/9308/3113/3117/3119/3122/3458/3565/22914/3932/8013/6504/7494/6375
GO:0050870 84433/6352/1236/915/916/959/972/9308/3113/3117/3119/3122/3458/3565/22914/3932/6504/7494/6375
GO:0001819 330/84433/834/1236/914/916/959/972/9308/56253/2205/2597/3113/8692/64135/3458/3565/22914/27035/8013/6504/7100/7494/6375
GO:0007159 84433/6352/1236/914/915/916/959/972/9308/2213/3113/3117/3119/3122/3458/3565/22914/3932/8013/6280/6504/7494/6375
GO:0022409 84433/6352/1236/915/916/959/972/9308/3113/3117/3119/3122/3458/3565/22914/3932/8013/6504/7494/6375
GO:1903037 84433/6352/1236/914/915/916/959/972/9308/2213/3113/3117/3119/3122/3458/3565/22914/3932/8013/6504/7494/6375
Count
GO:1903039 20
GO:0050870 19
GO:0001819 24
GO:0007159 23
GO:0022409 20
GO:1903037 22
> dim(go)
[1] 524 10
> dim(go[go$ONTOLOGY=='BP',])
[1] 482 10
> dim(go[go$ONTOLOGY=='CC',])
[1] 26 10
> dim(go[go$ONTOLOGY=='MF',])
[1] 16 10
# 看来这些差异基因主要还是富集到BP中了
# 进行简单的可视化
> barplot(go,showCategory=20,drop=T)
> dotplot(go,showCategory=50)
# 还可以绘制GO的网络关系图,但是值得注意的是这里的数据只能是富集一个GO通路(BP、CC或MF)的数据
> go.BP <- enrichGO(genelist, OrgDb = org.Hs.eg.db, ont='BP',pAdjustMethod = 'BH',pvalueCutoff = 0.05, qvalueCutoff = 0.2,keyType = 'ENTREZID')
> plotGOgraph(go.BP)
3.2 KEGG通路富集
KEGG通路富集函数用法与GO富集分析方法类似:
enrichKEGG KEGG Enrichment Analysis of a gene set. Given a vector of genes, this function will return
the enrichment KEGG categories with FDR control.
Usage:
enrichKEGG(gene, organism = "hsa", keyType = "kegg", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, minGSSize = 10, maxGSSize = 500,
qvalueCutoff = 0.2, use_internal_data = FALSE)
Arguments:
gene a vector of entrez gene id.
organism supported organism listed in ’http://www.genome.jp/kegg/catalog/org_list.html’
keyType one of "kegg", ’ncbi-geneid’, ’ncib-proteinid’ and ’uniprot’
pvalueCutoff Cutoff value of pvalue.
pAdjustMethod one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
universe background genes
minGSSize minimal size of genes annotated by Ontology term for testing.
maxGSSize maximal size of genes annotated for testing
qvalueCutoff qvalue cutoff
use_internal_data logical, use KEGG.db or latest online KEGG data
我们继续使用上面的数据进行KEGG富集分析:
> kegg <- enrichKEGG(genelist, organism = 'hsa', keyType = 'kegg', pvalueCutoff = 0.05,pAdjustMethod = 'BH',
minGSSize = 10,maxGSSize = 500,qvalueCutoff = 0.2,use_internal_data = FALSE)
> head(kegg)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
hsa04658 hsa04658 Th1 and Th2 cell differentiation 11/76 92/7404 1.750537e-09 2.765849e-07 2.026938e-07
hsa05310 hsa05310 Asthma 7/76 31/7404 1.956862e-08 1.433518e-06 1.050546e-06
hsa04672 hsa04672 Intestinal immune network for IgA production 8/76 49/7404 2.721869e-08 1.433518e-06 1.050546e-06
hsa05168 hsa05168 Herpes simplex infection 13/76 185/7404 3.744992e-08 1.479272e-06 1.084077e-06
hsa05140 hsa05140 Leishmaniasis 9/76 74/7404 5.040165e-08 1.592692e-06 1.167196e-06
hsa05330 hsa05330 Allograft rejection 7/76 38/7404 8.868232e-08 2.335301e-06 1.711413e-06
geneID Count
hsa04658 915/916/2353/3113/3117/3119/3122/3458/3565/3932/6775 11
hsa05310 959/2205/3113/3117/3119/3122/3565 7
hsa04672 959/7852/3113/3117/3119/3122/3565/608 8
hsa05168 6352/972/54205/1936/2353/3113/3117/3119/3122/64135/3434/3458/5187 13
hsa05140 2209/2212/2353/3113/3117/3119/3122/3458/3565 9
hsa05330 959/3113/3117/3119/3122/3458/3565 7
> dim(kegg)
[1] 30 9
# 简单可视化
> dotplot(kegg, showCategory=30)
4、GSEA富集分析
clusterProfiler provides
enricher
function for hypergeometric test andGSEA
function for gene set enrichment analysis that are designed to accept user defined annotation.
这里使用clusterProfiler里面的GSEA
函数进行GSEA富集分析,并与使用超几何分布富集(enricher
函数)的结果进行简单比较,enricher
函数与GSEA
函数用法基本相同,因此这里只给出GSEA
的用法及参数。
GSEA a universal gene set enrichment analysis tools
Usage:
GSEA(geneList, exponent = 1, nPerm = 1000, minGSSize = 10,
maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", TERM2GENE,
TERM2NAME = NA, verbose = TRUE, seed = FALSE, by = "fgsea")
Arguments:
geneList order ranked geneList
exponent weight of each step
nPerm number of permutations
minGSSize minimal size of each geneSet for analyzing
maxGSSize maximal size of genes annotated for testing
pvalueCutoff pvalue cutoff
pAdjustMethod p value adjustment method
TERM2GENE user input annotation of TERM TO GENE mapping, a data.frame of 2 column with term and gene
TERM2NAME user input of TERM TO NAME mapping, a data.frame of 2 column with term and name
verbose logical
seed logical
by one of ’fgsea’ or ’DOSE’
在进行富集分析之前需要对数据做一个预处理——排序。
> library(dplyr)
> geneList <- select(degenes, Entrez.ID, Fold.Change); head(geneList)
Entrez.ID Fold.Change
1 202 2.555354
2 26289 1.350051
3 306 -2.328736
4 55843 5.064183
5 433 2.061535
6 472 2.313447
> geneList.sort <- arrange(geneList, desc(Fold.Change)); head(geneList.sort)
Entrez.ID Fold.Change
1 3512 36.47332
2 3117 35.58685
3 3113 24.10151
4 916 14.89763
5 5996 14.67417
6 3119 11.12233
> gene <- geneList.sort$Entrez.ID
这里使用的是broad GSEA提供的gene sets 来提供TERM2GENE:
> gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler")
> c5 <- read.gmt(gmtfile)
> head(c5)
ont gene
1 NUCLEOPLASM 3190
2 NUCLEOPLASM 2547
3 NUCLEOPLASM 26173
4 NUCLEOPLASM 9439
5 NUCLEOPLASM 57508
6 NUCLEOPLASM 6837
万事俱备,只欠东风。现在可以开始分析了,先进行超几何分布的富集分析:
### 先使用基于超几何分布的富集分析
> enrich <- enricher(gene, TERM2GENE=c5); head(enrich)
ID Description GeneRatio BgRatio pvalue p.adjust
IMMUNOLOGICAL_SYNAPSE IMMUNOLOGICAL_SYNAPSE IMMUNOLOGICAL_SYNAPSE 2/55 11/5270 0.00553797 0.2547466
LYSOSOME LYSOSOME LYSOSOME 3/55 61/5270 0.02528071 0.3336995
LYTIC_VACUOLE LYTIC_VACUOLE LYTIC_VACUOLE 3/55 61/5270 0.02528071 0.3336995
VACUOLE VACUOLE VACUOLE 3/55 69/5270 0.03472762 0.3336995
LIPID_RAFT LIPID_RAFT LIPID_RAFT 2/55 29/5270 0.03627169 0.3336995
PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX 1/55 10/5270 0.09967809 0.7338789
qvalue geneID Count
IMMUNOLOGICAL_SYNAPSE 0.2390071 50852/84433 2
LYSOSOME 0.3130819 3122/9098/8692 3
LYTIC_VACUOLE 0.3130819 3122/9098/8692 3
VACUOLE 0.3130819 3122/9098/8692 3
LIPID_RAFT 0.3130819 3932/84433 2
PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX 0.6885363 54205 1
再做GSEA富集分析,在此之前需要对输入gene list做一下处理,包括三步:
## assume 1st column is ID
## 2nd column is FC
> head(geneList)
Entrez.ID Fold.Change
1 202 2.555354
2 26289 1.350051
3 306 -2.328736
4 55843 5.064183
5 433 2.061535
6 472 2.313447
## feature 1: numeric vector
> glist <- geneList[,2];head(glist)
[1] 2.555354 1.350051 -2.328736 5.064183 2.061535 2.313447
## feature 2: named vector
> names(glist) <- as.character(geneList[,1]);head(glist)
202 26289 306 55843 433 472
2.555354 1.350051 -2.328736 5.064183 2.061535 2.313447
## feature 3: decreasing order
> glist <- sort(glist,decreasing = T); head(glist)
3512 3117 3113 916 5996 3119
36.47332 35.58685 24.10151 14.89763 14.67417 11.12233
输入文件准备好了尽可以进行GSEA富集分析了:
> gsea <- GSEA(glist, TERM2GENE=c5, verbose=FALSE, pvalueCutoff = 0.8); head(gsea)
ID Description setSize enrichmentScore NES pvalue p.adjust qvalues rank leading_edge
MEMBRANE MEMBRANE MEMBRANE 30 0.4525068 1.2643778 0.1896024 0.6880521 0.6880521 29 tags=37%, list=23%, signal=37%
PLASMA_MEMBRANE PLASMA_MEMBRANE PLASMA_MEMBRANE 28 0.4358283 1.2085736 0.2435766 0.6880521 0.6880521 29 tags=36%, list=23%, signal=35%
CYTOPLASMIC_PART CYTOPLASMIC_PART CYTOPLASMIC_PART 12 -0.2809599 -1.1642120 0.2481203 0.6880521 0.6880521 34 tags=67%, list=27%, signal=54%
MEMBRANE_PART MEMBRANE_PART MEMBRANE_PART 27 0.3603900 0.9960600 0.4928131 0.6880521 0.6880521 68 tags=70%, list=53%, signal=42%
PLASMA_MEMBRANE_PART PLASMA_MEMBRANE_PART PLASMA_MEMBRANE_PART 24 0.3593591 0.9832720 0.5092593 0.6880521 0.6880521 68 tags=71%, list=53%, signal=41%
INTEGRAL_TO_MEMBRANE INTEGRAL_TO_MEMBRANE INTEGRAL_TO_MEMBRANE 22 0.3618337 0.9738809 0.5191710 0.6880521 0.6880521 68 tags=73%, list=53%, signal=41%
core_enrichment
MEMBRANE 916/5996/3122/7852/972/969/6402/2205/56253/10225/3738
PLASMA_MEMBRANE 916/5996/3122/7852/969/6402/2205/56253/10225/3738
CYTOPLASMIC_PART 8692/54205/1936/2879/9531/293/23787
MEMBRANE_PART 916/7852/972/969/6402/2205/10225/3738/3932/30061/9308/608/51266/50852/51348/29121/1235/84433/22914
PLASMA_MEMBRANE_PART 916/7852/969/6402/2205/10225/3738/3932/30061/9308/51266/50852/51348/29121/1235/84433/22914
INTEGRAL_TO_MEMBRANE 7852/972/969/6402/2205/10225/3738/30061/9308/608/51266/50852/51348/29121/1235/22914
不要问我为什么要pvalueCutoff设置0.8,因为一直调大到0.8才富集到结果。。。可见数据应该是有问题的,但这里作为一个实践就不管那么多了。
而且,clusterProfiler还支持GSEA的GO、KEGG富集。
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
> gsea.go <- gseGO(glist,OrgDb = org.Hs.eg.db, pvalueCutoff = 0.5); head(gsea.go)
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
ID Description setSize enrichmentScore NES pvalue p.adjust qvalues rank leading_edge
GO:0031294 GO:0031294 lymphocyte costimulation 11 0.7832645 1.872828 0.001186240 0.1049915 0.09363642 10 tags=45%, list=8%, signal=46%
GO:0031295 GO:0031295 T cell costimulation 11 0.7832645 1.872828 0.001186240 0.1049915 0.09363642 10 tags=45%, list=8%, signal=46%
GO:0002376 GO:0002376 immune system process 74 0.6032615 1.738943 0.002004008 0.1049915 0.09363642 60 tags=64%, list=47%, signal=80%
GO:0042110 GO:0042110 T cell activation 23 0.6722496 1.825647 0.002143623 0.1049915 0.09363642 24 tags=48%, list=19%, signal=47%
GO:0050851 GO:0050851 antigen receptor-mediated signaling pathway 11 0.7582504 1.813018 0.002372479 0.1049915 0.09363642 10 tags=45%, list=8%, signal=46%
GO:0050852 GO:0050852 T cell receptor signaling pathway 10 0.7963700 1.872212 0.002380952 0.1049915 0.09363642 10 tags=50%, list=8%, signal=50%
core_enrichment
GO:0031294 3117/3113/916/3119/3122
GO:0031295 3117/3113/916/3119/3122
GO:0002376 3512/3117/3113/916/5996/3119/914/10964/3122/7852/972/4068/6402/6352/84636/915/2205/51176/4069/2213/56253/10225/55619/4332/2212/6375/7100/3434/8638/3932/2209/51316/30061/9308/64135/8320/10578/3458/10875/608/83666/51266/50852/3437/51348/2353/9111
GO:0042110 3117/3113/916/3119/914/3122/972/6352/915/51176/2213
GO:0050851 3117/3113/916/3119/3122
GO:0050852 3117/3113/916/3119/3122
顺便再利用上面处理好的glist进行一下KEGG富集到的某一条通路的可视化:
> library(pathview)
> pathview(gene.data = glist, pathway.id = 'hsa04658',species="hsa", limit=list(gene=max(abs(glist)), cpd=1))
上面的命令会在当前目录生成3个文件:一个原始KEGG通路图片,一个标注了上下调基因的,最后一个文本文件则是一些KEGG通路信息。
参考:
clusProfiler命令参考手册
GSEA的分析汇总
Statistical analysis and visualization of functional profiles for genes and gene clusters
GSEA分析是个什么鬼?(上)
GSEA是个什么鬼?(下)