使用Gfold进行无重复RNAseq数据基因差异分析
1. 准备资料
输入的资料为read_count,要对两个样本进行差异分析则必需准备两个文件,每个文件格式如下:
文件1:Sample2.read_cnt
hsa-miR-146a-3p 20 48 20 20
novel_mir217 45 20 20 20
hsa-miR-548o-3p 20 40 20 20
hsa-miR-378f 20 17703 20 20
novel_mir857 20 38 20 20
novel_mir1083 20 37 20 20
novel_mir790 20 36 20 20
novel_mir20 20 31 20 20
novel_mir1517 20 25 20 20
文件2:Sample1.read_cnt
hsa-miR-146a-3p 20 0 20 20
novel_mir217 20 0 20 20
hsa-miR-548o-3p 20 0 20 20
hsa-miR-378f 20 16 20 20
novel_mir857 20 0 20 20
novel_mir1083 20 0 20 20
novel_mir790 20 0 20 20
novel_mir20 20 0 20 20
novel_mir1517 20 0 20 20
在不同版本中输入文件的要求不同,一般第一列表头为GeneName,必需要有5列(如果只想计算gfold值,就只需要保证 Read Count所属的那列正确,其他列可以随意编造,如需要计算RPKM值则Gene exon length所属的那一列不能随意编造)程序才会正确运行,并且列分隔符必须为tab
键('\t')
在2015-05-24更新的V1.1.4版本中,列依次为GeneSymbol、GeneName、Read Count、Gene exon length、RPKM
R语言提取得到文件1和文件2代码:
注意两个文件的第一列必须完全相同
# raw data input
file = "H358_Vector.gene.FPKM.xls"
raw_data_Vec <- read.delim(file, stringsAsFactors=FALSE)
file <- "H358_LSH.gene.FPKM.xls"
raw_data_LSH <- read.delim(file, stringsAsFactors=FALSE)
## merge data into one table
data_merged <- merge(raw_data_Vec[, c(1, 4)], raw_data_LSH[, c(1, 4)], by.x=1, by.y=1, all=TRUE)
data_merged[is.na(data_merged)] <- 0
## prepare the input data for gfold
data1 <- data.frame(data_merged[, c(1, 1, 2, 1, 2)])
data2 <- data.frame(data_merged[, c(1, 1, 3, 1, 3)])
colnames(data1) <- c("GeneSymbol", "GeneName1", "Read Count1", "Gene exon length1", "RPKM1")
colnames(data2) <- c("GeneSymbol", "GeneName2", "Read Count2", "Gene exon length2", "RPKM2")
write.table(data1, file="Sample1Vec.read_cnt", row.names=F, col.names=F, quote=F, sep="\t")
write.table(data2, file="Sample2LSH.read_cnt", row.names=F, col.names=F, quote=F, sep="\t")
## gfold diff -s1 Sample1Vec.read_cnt -s2 Sample2LSH.read_cnt -o Sample1VSSample2.diff
2. 软件安装(只能在linux上使用)
2.1 安装 gsl-1.15
# /mnt/e/win_linux/gsl 替换为自己的安装目录
cd /mnt/e/win_linux/gsl
vi ~/.bashrc
export CXXFLAGS="-g -O3 -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib"
export LD_LIBRARY_PATH="/mnt/e/win_linux/gsl/lib:"$LD_LIBRARY_PATH
source ~/.bashrc
wget ftp://ftp.gnu.org/gnu/gsl/gsl-2.2.tar.gz
tar zxf gsl-1.15.tar.gz
cd gsl-1.15
./configure --prefix=/mnt/e/win_linux/gsl
make
make install
2.2 安装gfold
cd ..
## 在这里https://bitbucket.org/feeldead/gfold/downloads/?tab=tags选择一个版本下载,
## 注意不要选择最上面tag为tip的(安装不了)
wget https://bitbucket.org/feeldead/gfold/get/V1.1.4.zip
unzip V1.1.4.zip
cd feeldead-gfold-1921fd6dc668/
make
## 报错
## g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas
## In file included from GeneInfo.hpp:29:0,
## from main.cc:24:
## Utility.hpp:69:36: fatal error: gsl/gsl_statistics_int.h: No such file or directory
## compilation terminated.
## Makefile:11: recipe for target 'program' failed
## make: *** [program] Error 1
## 运行以下命令
export CXXFLAGS="-g -O3 -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib"
export LD_LIBRARY_PATH="/mnt/e/win_linux/gsl/lib:"$LD_LIBRARY_PATH
g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib
## 虽然显示warning 但是已经安装好了,有可执行文件gfold了
./gfold -h
## ===============================================================================
## gfold : Generalized fold change for ranking differentially expressed
## genes from RNA-seq data.
##
## Author : Jianxing Feng (jianxing.tongji@gmail.com)
## Date : Sun May 24 07:42:36 CST 2015
## Version : V1.1.4
## ===============================================================================
##
## USAGE: Please go to the source code directory and use command 'man doc/gfold.man'
## or open doc/gfold.html to find documentation.
##
## Quick Examples:
## gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt
## gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt
## gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff
rm -rf *.zip
mv feeldead-gfold-1921fd6dc668/ gfold.V1.1.4
echo "alias gfold='/mnt/e/win_linux/gfold.V1.1.4/gfold'" >> ~/.profile
source ~/.profile
gfold -h
3. 运行gfold筛选差异基因
# s1为样本1,s2为样本2,o为output,suf为后缀名(省略则默认表示为空)
gfold diff -s1 Sample1.read_cnt -s2 Sample2.read_cnt -o Sample1VSSample2.diff
## 与该命令等价的命令:
## gfold diff -s1 Sample1 -s2 Sample2 -suf .read_cnt -o Sample1VSSample2.diff
## -VL1 Loading read count for each gene from file Vector.read_cnt ...
## -VL1 1935 genes have been scanned
## -VL1 Loading read count for each gene from file LSH.read_cnt ...
## -VL1 1935 genes have been scanned
## -VL1 Calculating normalization constant ...
## -VL1 Normalization constant is:
## Vector.read_cnt 2.3238e+07 1.04573
## LSH.read_cnt 2.80085e+07 1
## -VL1 Calculate RPKM ...
## -VL1 Calculate accurate GFOLD value ...
## 等待一段时间
## Job diff is DONE!
4. 结果解释
# This file is generated by gfold V1.1.4 on Fri Jan 19 21:19:05 2018
# Normalization constants :
# Sample1N.read_cnt 6802895 1.35542
# Sample2N.read_cnt 12483845 1
# The GFOLD value could be considered as a reliable log2 fold change.
# It is positive/negative if the gene is up/down regulated.
# A gene with zero GFOLD value should never be considered as
# differentially expressed. For a comprehensive description of
# GFOLD, please refer to the manual.
#GeneSymbol GeneName GFOLD(0.01) E-FDR log2fdc 1stRPKM 2ndRPKM
novel_mir1585 NA 6.45001 1 9.19028 0 3444.45
hsa-miR-4664-5p NA 5.70336 1 8.44996 0 2058.66
novel_mir252 NA 4.13781 1 6.91447 0 704.911
novel_mir1060 NA 3.64411 1 6.43874 0 504.652
4.1 GFOLD
A gene with zero GFOLD value should never be considered as differentially expressed.
The GFOLD value could be considered as a reliable log2 fold change.
It is positive/negative if the gene is up/down regulated.
注:如果GFOLD大于0说明Sample2的表达高于sample1,基因上调(表示为Sample1 VS Sample2)
4.2 E-FDR
Empirical FDR based on replicates. It is always 1 when no replicates are available
因此无重复样本的经验FDR均为1
4.3 log2fdc
log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.
-acc参数在无重复样本时使用,表示计算Gfold的校正方法,如果-acc值为T(默认值为T),表示根据测序深度进行校正(熟读较慢),如果-acc值为F,表示使用马尔科夫蒙特卡洛方法,-acc参数仅在diff功能下指定。
log2fdc表示log2(fold change),该值默认是经校正的count值改变倍数取log2
4.3 RPKM
由于我输入的Gene exon length是假的,因此最后两列没有意义
参考:
https://kevinzjy.github.io/2017/05/20/Bioinfo-Differential-Expression/
http://www.biotrainee.com/thread-752-1-1.html
http://www.chenlianfu.com/?p=1085
https://bitbucket.org/feeldead/gfold
https://web.archive.org/web/20161125133249/http://compbio.tongji.edu.cn:80/~fengjx/GFOLD/gfold.html