学习目标: 前面下载了SRR3589956.sra-SRR3589962.sra的RNA-seq数据,本次用sratoolkit.2.6.3软件解压,并查看fastq数据的格式,用fastqc软件检验其数据质量,IGV可视化数据,学会批量操作。
参考:http://www.biotrainee.com/thread-1831-1-1.html
http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU-r1HPcQQ2irG2836uQYm2iZAyh1Zwf3_
1. sratoolkit的使用
fastq-dump -h查看帮助
fastq-dump [options] <path> [<path>...] #基本用法
常用参数:
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip #fastqc软件可以直接识别gzip压缩的文件
--bzip2 Compress output using bzip2 #比gzip压缩率高但是慢
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
--split-files Dump each read into separate file.Files
will receive suffix corresponding to read
number
--split-3 Legacy 3-file splitting for mate-pairs:
First biological reads satisfying dumping
conditions are placed in files *_1.fastq and
*_2.fastq If only one biological read is
present it is placed in *.fastq Biological
reads and above are ignored.
学会批量解压:
for i in `seq 56 62`
do
/opt/NfsDir/BioDir/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 -O /opt/NfsDir/UserDir/qin/qin/Data/RNAseq/ -A SRR35899${i}.sra
done
bash命令能够直接用于解压缩文件,如zgrep,zcat,zless,zdiff等。举例:zcat SRR3589956_1.fastq.gz | head -n 4
2.fastqc批量查看测序质量
参考:http://www.biotrainee.com/thread-324-1-1.html
格式: FASTQ文件每个序列通常为4行,分别为:
@DJB775P1:248:D0MDGACXX:7:1202:12362:49613 1:Y:18:ATCACG #第一行:@字符开头的标题行,分别为:设备名称/run id/flowcell id/flowcell lane/tile number within the flowcell lane/'x'-coordinate of the cluster within the tile/'y'-coordinate of the cluster within the tile/the member of a pair, 1 or 2/Y if the read is filtered, N otherwise/0 when none of the control bits are on, otherwise it is an even number/index sequence
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA #序列
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA #碱基质量格式phred+33
fastqc用法:
fastqc SRR3589956_1.fastq.gz
fastqc seqfile1 seqfile2 .. seqfileN
常用参数:
-o: 输出路径-
-extract: 输出文件是否需要自动解压 默认是--noextract-
t: 线程, 和电脑配置有关,每个线程需要250MB的内存
-c: 测序中可能会有污染, 比如说混入其他物种
-a: 接头-
q: 安静模式
结果产生两个文件 查看SRR3589956质控结果,为啥中间少了一块?
multiQC批量质控查看结果
# 先获取QC结果
ls *gz | while read id; do /opt/NfsDir/BioDir/fastqc/FastQC/fastqc -t 4 $id; done
# multiqc
multiqc *fastqc.zip --pdf