GATK的本职工作是Variant calling,但是就像我之前所说的,它作为基因组分析工具箱,还是有很多其他工具,今天学习的是诊断和质量控制工具的其中两个:CountReads,FastaStats。
FastaStats
功能: 计算参考基因组的基本统计值
分类: 诊断和质量控制工具
概要: 主要就是统计碱基总数,和常规的碱基数(ATCG)
吐槽:功能还真是简单。。
输入: FASTA参考文件
输出: 结果要么输出到屏幕,要么是输出到(-o)到文件中。
使用案例:我看了一下拟南芥的参考基因组
java -jar ~/biosoft/GenomeAnalysisTK.jar -T FastaStats \
-R TAIR10.fa
输出:
Total bases 119667750
Regular bases 119481543
CountReads
功能: 计算reads数
分类: 诊断和质量控制工具
概要: 最好和--read-filter
合用,这样可以了解下符合特定标准的reads数
输入: 一个或多个BAM文件
输出: 结果会输出到屏幕(标准输出)上,毕竟是用来确定阈值的,也不需要一定要输出到文件中
不加参数--read-filter
java -jar ~/biosoft/GenomeAnalysisTK.jar -T CountReads -R $work/database/TAIR10/TAIR10.fa -I BC_bg_reads.sorted.bam
输出:
CountReads - CountReads counted 55080781 reads in the traversal
--read-filter/-rf
后面可以接很多的选项,官方文档列出了如下内容:
Name | Summary |
---|---|
BadCigarFilter | Filter out reads with wonky CIGAR strings |
BadMateFilter | Filter out reads whose mate maps to a different contig |
CountingFilteringIterator.CountingReadFilter | |
DuplicateReadFilter | Filter out duplicate reads |
FailsVendorQualityCheckFilter | Filter out reads that fail the vendor quality check |
HCMappingQualityFilter | Filter out reads with low mapping qualities for HaplotypeCaller |
LibraryReadFilter | Only use reads from the specified library |
MalformedReadFilter | Filter out malformed reads |
MappingQualityFilter | Filter out reads with low mapping qualities |
MappingQualityUnavailableFilter | Filter out reads with no mapping quality information |
MappingQualityZeroFilter | Filter out reads with mapping quality zero |
MateSameStrandFilter | Filter out reads with bad pairing (and related) properties |
MaxInsertSizeFilter | Filter out reads that exceed a given insert size |
MissingReadGroupFilter | Filter out reads without read group information |
NoOriginalQualityScoresFilter | Filter out reads that do not have an original quality quality score (OQ) tag |
NotPrimaryAlignmentFilter | Filter out read records that are secondary alignments |
OverclippedReadFilter | Filter out reads that are over-soft-clipped |
Platform454Filter | Filter out reads produced by 454 technology |
PlatformFilter | Filter out reads that were generated by a specific sequencing platform |
PlatformUnitFilter | Filter out reads with blacklisted platform unit tags |
ReadGroupBlackListFilter | Filter out reads matching a read group tag value |
ReadLengthFilter | Filter out reads based on length |
ReadNameFilter | Only use reads with this read name |
ReadStrandFilter | Filter out reads based on strand orientation |
ReassignMappingQualityFilter | Set the mapping quality of all reads to a given value |
ReassignOneMappingQualityFilter | Set the mapping quality of reads with a given value to another given value |
ReassignOriginalMQAfterIndelRealignmentFilter | Revert the MQ of reads that were modified by IndelRealigner |
SampleFilter | Only use reads belonging to a specific sample |
SingleReadGroupFilter | Only use reads from the specified read group |
UnmappedReadFilter | Filter out unmapped reads |
使用的时候记住,对于XXXXFilter,你需要点每个过滤选项进去看下具体用法,比如说我需要过滤比对质量低于20的,我发现MappingQualityFilter的用法解释是
java -jar GenomeAnalysisTk.jar \
-T HaplotypeCaller \
-R reference.fasta \
-I input.bam \
-o output.vcf \
-rf MappingQuality \
-mmq 15
那么我前面的代码就可以改为
java -jar ~/biosoft/GenomeAnalysisTK.jar -T CountReads -R $work/database/TAIR10/TAIR10.fa -I BC_bg_reads.sorted.bam \
-rf MappingQuality -mmq 15
结果:
8190205 reads were filtered out during the traversal out of approximately 55080781 total reads (14.87%)
果然加上比对质量后,就有一些要被过滤掉呢。
基本上GATK每一个工具都有一些默认过滤选项,你可以在每个工具的额外信息部分进行了解。
今天就介绍这两个工具,顺便提了GATK的过滤功能。白了个白,每天学个GATK的任务完成。