学习目标： 前面下载了SRR3589956.sra-SRR3589962.sra的RNA-seq数据，本次用sratoolkit.2.6.3软件解压，并查看fastq数据的格式，用fastqc软件检验其数据质量，IGV可视化数据，学会批量操作。
参考：http://www.biotrainee.com/thread-1831-1-1.html
http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU-r1HPcQQ2irG2836uQYm2iZAyh1Zwf3_

1. sratoolkit的使用

fastq-dump -h查看帮助

fastq-dump [options] <path> [<path>...] #基本用法

常用参数：

INPUT
  -A|--accession <accession>       Replaces accession derived from <path> in 
                                   filename(s) and deflines (only for single 
                                   table dump) 
  --table <table-name>             Table name within cSRA object, default is 
                                   "SEQUENCE" 

OUTPUT
  -O|--outdir <path>               Output directory, default is working 
                                   directory '.' ) 
  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream 
  --gzip                           Compress output using gzip  #fastqc软件可以直接识别gzip压缩的文件
  --bzip2                          Compress output using bzip2  #比gzip压缩率高但是慢

Multiple File Options              Setting these options will produce more
                                     than 1 file, each of which will be suffixed
                                     according to splitting criteria.
  --split-files                    Dump each read into separate file.Files 
                                   will receive suffix corresponding to read 
                                   number 
  --split-3                        Legacy 3-file splitting for mate-pairs: 
                                   First biological reads satisfying dumping 
                                   conditions are placed in files *_1.fastq and 
                                   *_2.fastq If only one biological read is 
                                   present it is placed in *.fastq Biological 
                                   reads and above are ignored.

学会批量解压：

for i in `seq 56 62`
do 
    /opt/NfsDir/BioDir/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 -O /opt/NfsDir/UserDir/qin/qin/Data/RNAseq/ -A SRR35899${i}.sra
done

bash命令能够直接用于解压缩文件，如zgrep,zcat,zless,zdiff等。举例：zcat SRR3589956_1.fastq.gz | head -n 4

2.fastqc批量查看测序质量

参考：http://www.biotrainee.com/thread-324-1-1.html

格式： FASTQ文件每个序列通常为4行，分别为：

@DJB775P1:248:D0MDGACXX:7:1202:12362:49613 1:Y:18:ATCACG #第一行：@字符开头的标题行，分别为：设备名称/run id/flowcell id/flowcell lane/tile number within the flowcell lane/'x'-coordinate of the cluster within the tile/'y'-coordinate of the cluster within the tile/the member of a pair, 1 or 2/Y if the read is filtered, N otherwise/0 when none of the control bits are on, otherwise it is an even number/index sequence
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA #序列
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA #碱基质量格式phred+33

fastqc用法:

fastqc SRR3589956_1.fastq.gz
fastqc seqfile1 seqfile2 .. seqfileN
常用参数：
-o： 输出路径-
-extract: 输出文件是否需要自动解压 默认是--noextract-
t: 线程， 和电脑配置有关，每个线程需要250MB的内存
-c: 测序中可能会有污染， 比如说混入其他物种
-a: 接头-
q: 安静模式

结果产生两个文件

Paste_Image.png

查看SRR3589956质控结果，为啥中间少了一块？

Paste_Image.png

multiQC批量质控查看结果

# 先获取QC结果
ls *gz | while read id; do /opt/NfsDir/BioDir/fastqc/FastQC/fastqc -t 4 $id; done
# multiqc
multiqc *fastqc.zip --pdf

Paste_Image.png

转录组（3）：了解fastq测序数据

转录组（3）：了解fastq测序数据

1. sratoolkit的使用

2.fastqc批量查看测序质量

推荐阅读更多精彩内容