Identifying genomic variants, including single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), from next generation sequencing data is an important part of scientific discovery.
To facilitate this research, a bioinformatics pipeline has been developed to enable researchers to accurately and rapidly identify, and annotate, sequence variants. The pipeline employs the Genome Analysis Toolkit 4 (GATK4, gatk-4.2.0.0) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute.
This pipeline is intended for calling variants in samples that are clonal – i.e. a single individual. The frequencies of variants in these samples are expected to be 1 (for haploids or homozygous diploids) or 0.5 (for heterozygous diploids).
Workflow Overview
Idealized workflow for calling germline short variants
Step 1. Alignment and sorting
--- |
--- |
Tool |
BWA--MEM, SAMtools |
Input |
fastq files, reference genome |
Output |
$sample.sorted.bam |
Notes |
Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag. BamSortIing step can be done using the SAMtools |
Command |
bwa.sh |
INDEX=/media/sf_J_DRIVE/Project_WD/0.Reference/index/bwa/hg38
cat /home/allen/Project/wes2/2-clean_data/bwa_config |while read id
do
arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
bwa mem -t 2 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2 | samtools sort -@ 2 -o /home/allen/Project/wes2/3-align/$sample.sorted.bam -
done
Step 2. Collect Alignment & Insert Size Metrics
--- |
--- |
Tool |
SAMtools |
Input |
$sample.sorted.bam, reference genome |
Output |
stats folder, $sample.sorted.html |
Notes |
collect bam metrics. |
Command |
stats.sh |
cat align_config | while read id
do
bam=/home/allen/Project/wes2/3-align/${id}.bam
samtools stats -@4 --reference /media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta ${bam} > /home/allen/Project/wes2/3-align/stats/${id}.stat
plot-bamstats -p /home/allen/Project/wes2/3-align/stats/${id} /home/allen/Project/wes2/3-align/stats/${id}.stat
done
Step 3. Mark Duplicates
--- |
--- |
Tool |
GATK4-MarkDuplicates, SAMtools |
Input |
$sample.sorted.bam |
Output |
[${id}.marked.bam], [$${id}.marked.bai] |
Notes |
After the MarkDuplicates step, bam metrics file will be given by -M, .bai file can be done using the samtools index command. |
Command |
gatk4.sh |
gatk=gatk
cat markDu_config | while read id
do
BAM=/home/allen/Project/wes2/3-align/${id}.bam
if [ ! -f /home/allen/Project/wes2/4-GATK/ok.${id}.marked.status ]
then
echo "start MarkDuplicates for ${id}" 'date'
gatk --java-options "-Xmx4g -Djava.io.temdir=/home" MarkDuplicates \
-I ${BAM} \
-O /home/allen/Project/wes2/4-GATK/${id}.marked.bam \
-M /home/allen/Project/wes2/4-GATK/${id}.metrics \
1>/home/allen/Project/wes2/4-GATK/${id}.mark.log 2>&1
if [ $? -eq 0 ]
then
touch /home/allen/Project/wes2/4-GATK/ok.${id}.marked.status
fi
echo "end MarkDuplicates for ${id}" 'date'
samtools index -@ 4 -b /home/allen/Project/wes2/4-GATK/${id}.marked.bam /home/allen/Project/wes2/4-GATK/${id}.marked.bai
fi
done
Step 4. Base Quality Score Recalibration (BQSR)
--- |
--- |
Tool |
GATK4-BaseRecalibrator and ApplyBQSR |
Input |
${id}.marked.bam, reference genome, known snp, known indel |
Output |
${id}.marked.bqsr.bam |
Notes |
Command |
BQSR.sh |
gatk=gatk
snp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/dbsnp_146.hg38.vcf.gz
indel=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
cat bqsr_config | while read id
do
if [ ! -f /home/allen/Project/wes2/4-GATK/BQSR/${id}.marked.bqsr.bam ]
then
echo "start BQSR for ${id}" 'date'
$gatk --java-options "-Xmx4g -Djava.io.temdir=/home" BaseRecalibrator \
-R $ref \
-I /home/allen/Project/wes2/4-GATK/${id}.marked.bam \
--known-sites ${snp} \
--known-sites ${indel} \
-O /home/allen/Project/wes2/4-GATK/BQSR/${id}.recal.table \
1>/home/allen/Project/wes2/4-GATK/BQSR/${id}.recal.log 2>&1
$gatk --java-options "-Xmx4g -Djava.io.temdir=/home" ApplyBQSR \
-R $ref \
-I /home/allen/Project/wes2/4-GATK/${id}.marked.bam \
-bqsr /home/allen/Project/wes2/4-GATK/BQSR/${id}.recal.table \
-O /home/allen/Project/wes2/4-GATK/BQSR/${id}.marked.bqsr.bam
1>/home/allen/Project/wes2/4-GATK/BQSR/${id}.ApplyBQSR.log 2>&1
echo "end BQSR for ${id}" 'date'
fi
done
Step 5. Call Variants
--- |
--- |
Tool |
GATK4 |
Input |
${id}.marked.bqsr.bam, reference genome, known snp, known indel, bed |
Output |
merge.vcf |
Notes |
Variant calling with GenomicsDBImport method. |
Command |
germline.sh |
gatk=gatk
snp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/dbsnp_146.hg38.vcf.gz
indel=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
bed=/media/sf_J_DRIVE/Project_WD/Osteopetrosis/hglft-hg38.bed
cat germline_config | while read id
do
echo "start HC for ${id}" 'date'
$gatk --java-options "-Xmx4g -Djava.io.temdir=/home" HaplotypeCaller -ERC GVCF \
-R ${ref} \
-I /home/allen/Project/wes2/4-GATK/BQSR/${id}.marked.bqsr.bam \
--dbsnp ${snp} \
-L ${bed} \
-O /home/allen/Project/wes2/4-GATK/gvcf/${id}.raw.vcf \
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.HC.log 2>&1
echo "finish HC for ${id}" 'date'
done
cd /home/allen/Project/wes2/4-GATK/gvcf
for chr in chr{1..22}
do
time $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" GenomicsDBImport \
-R ${ref} \
$(ls /home/allen/Project/wes2/4-GATK/gvcf/*raw.vcf | awk '{print "-V "$0" "}') \
-L ${chr} \
--genomicsdb-workspace-path gvcf_${chr}.db
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.GenomicsDBImport.log 2>&1
time $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" GenotypeGVCFs \
-R ${ref} \
-V gendb://gvcf_${chr}.db \
-O gvcfs_${chr}.vcf
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.GenotypeGVCFs.log 2>&1
done
$gatk --java-options "-Xmx4g -Djava.io.temdir=/home" GatherVcfs \
$(for i in {1..22}; do echo "-I gvcfs_chr${i}.vcf"; done) \
-O merge.vcf
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.GatherVcfs.log 2>&1
Step 6. Variant Filtration
--- |
--- |
Tool |
GATK4 |
Input |
merge.vcf, reference genome, known snp, known indel, hapmap, omni |
Output |
merge.HC.indels.snps.VQSR.vcf |
Notes |
GATK HaplotypeCaller之后,首选的质控方案是GATK VQSR,它通过机器学习的方法利用多个不同的数据特征训练一个模型(高斯混合模型)对变异数据进行质控,然而不幸的是使用VQSR需要具备以下两个条件:第一,需要一个精心准备的已知变异集,它将作为训练质控模型的真集。第二,要求新检测的结果中有足够多的变异,不然VQSR在进行模型训练的时候会因为可用的变异位点数目不足而无法进行。由于条件1的限制,会导致很多非人的物种在完成变异检测之后没法使用GATK VQSR的方法进行质控。而由于条件2,也常常导致一些小panel甚至外显子测序,由于最后的变异位点不够,也无法使用VQSR。这个时候,我们就不得不选择硬过滤的方式来质控了。那什么叫做硬过滤呢?所谓硬过滤其实就是通过人为设定一个或者若干个指标阈值(也可以叫数据特征值),然后把所有不满足阈值的变异位点采用一刀切掉的方法。 |
Command |
VQSR(snp_indel_VQSR.sh), Hard-filtering |
VQSR
##先对Indels进行过滤,再对snp进行过滤
gatk=gatk
hapmap=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/hapmap_3.3.hg38.vcf.gz
omni=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/1000G_omni2.5.hg38.vcf.gz
snp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
dbsnp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/dbsnp_146.hg38.vcf.gz
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
indel=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
##Indel Mode
time $gatk VariantRecalibrator \
-R $ref \
-V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
--resource:mills,known=true,training=true,truth=true,prior=12.0 $indel \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp \
-an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
-mode INDEL \
--max-gaussians 4 \
--rscript-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.plots.R \
--tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.tranches \
-O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.recal \
1>/home/allen/Project/wes2/4-GATK/population/merge.indel.vqsr.log 2>&1
time $gatk ApplyVQSR \
-R $ref \
-V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
--truth-sensitivity-filter-level 99.0 \
--tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.tranches \
--recal-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.recal \
-mode INDEL \
-O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.VQSR.vcf \
1>/home/allen/Project/wes2/4-GATK/population/merge.indels.ApplyVQSR.log 2>&1
echo "** Indels VQSR done **"
##SNP Mode
time $gatk VariantRecalibrator \
-R $ref \
-V /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.VQSR.vcf \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
--resource:omini,known=false,training=true,truth=false,prior=12.0 $omni \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 $snp \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp \
-an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
--rscript-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.plots.R \
--tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.tranches \
-O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.recal \
1>/home/allen/Project/wes2/4-GATK/population/merge.indels.snps.vqsr.log 2>&1
time $gatk ApplyVQSR \
-R $ref \
-V /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.VQSR.vcf \
--truth-sensitivity-filter-level 99.0 \
--tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.tranches \
--recal-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.recal \
-mode SNP \
-O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.VQSR.vcf \
1>/home/allen/Project/wes2/4-GATK/population/merge.indels.snps.ApplyVQSR.log 2>&1
echo "** SNPs VQSR done **"
Hard-filtering
那么如何执行硬过滤?首先,需要我们确定该用哪些指标来评价变异的好坏。这个非常重要,选择对了事半功倍,选得不合理,过滤的结果有时还不如不过滤的。如果把这个问题放在从前,我们需要做比较多的尝试才能确定一些合适的指标,但现在就方便很多了,可以直接使用GATK VQSR所用的指标——毕竟这些指标都是经过精挑细选的。我想这应该不难理解,既然VQSR就是用这些指标来训练质控模型的,那么它们就可以在一定程度上描述每个变异的质量,我们用这些指标设置对应的阈值来进行硬过滤也将是合理的。VQSR使用的数据指标有6个(这些指标都在VCF文件的INFO域中,如果不是GATK得到的变异,可能会有所不同,但知道它们的含义之后也是可以自己计算的),分别是:
- QualByDepth(QD)
- FisherStrand (FS)
- StrandOddsRatio (SOR)
- RMSMappingQuality (MQ)
- MappingQualityRankSumTest (MQRankSum)
- ReadPosRankSumTest (ReadPosRankSum)
指标有了,那么阈值应该设置为多少?下面我想先给出一个硬过滤的例子,然后再逐个来对其进行分析,以便大家能够更好地理解变异质控的思路。值得注意的是不同的数据,有不同的情况,它的阈值有时是不同的。不过不用担心,当你掌握了如何做的思路之后完全有能力根据具体的情况举一反三。
首先是硬过滤的例子,这个过程我都用最新的GATK来完成。GATK 4.0中有一个专门的VariantFiltration模块(继承自GATK 3.x),它可以很方便地帮我们完成这个事情。不过,过滤的时候,需要分SNP和Indel这两个不同的变异类型来进行,它们有些阈值是不同的,需要区别对待。
gatk=gatk
##Subset to SNPs-only callset with SelectVariants
$gatk SelectVariants \
-V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
-select-type SNP \
-O /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps.vcf.gz \
1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/select.snps.log 2>&1
##Hard-filter SNPs on multiple expressions using VariantFiltration
$gatk VariantFiltration \
-V /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps_hard_filtered.vcf.gz \
1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/snps.hardfilter.log 2>&1
##Subset to indels-only callset with SelectVariants
$gatk SelectVariants \
-V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
-select-type INDEL \
-O /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.vcf.gz \
1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/select.indels.log 2>&1
##Similarly, hard-filter indels on multiple expressions using VariantFiltration
$gatk VariantFiltration \
-V /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
-O /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels_hard_filtered.vcf.gz \
1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.hardfilter.log 2>&1
##融合过滤后的indel和snp
$gatk MergeVcfs \
-I /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps_hard_filtered.vcf.gz \
-I /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels_hard_filtered.vcf.gz \
-O /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels_snp_hard_filtered.vcf.gz \
1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.snps.merge.log 2>&1
Step 7. Variant Annotation
--- |
--- |
Tool |
Annovar, Funcotator |
Input |
merge.HC.indels.snps.VQSR.vcf |
Output |
merge.hg38.annovar.txt.hg38_multianno.txt, merge_funcotator.tmp.maf |
Notes |
注意注释软件的数据库版本 |
Command |
annovar.sh, funcotator.sh |
Annovar
echo "start ANNOVAR for " `date`
/home/allen/annovar/table_annovar.pl /home/allen/Project/wes2/4-GATK/population/indel_snp_VQSR/merge.HC.indels.snps.VQSR.vcf /home/allen/annovar/humandb/ \
-buildver hg38 \
-out /home/allen/Project/wes2/4-GATK/annotation/annovar/merge.hg38.annovar.txt \
-remove \
-protocol clinvar_20210501 \
-operation f \
-nastring . \
-vcfinput \
1>/home/allen/Project/wes2/4-GATK/annotation/annovar/annovar.log 2>&1
echo "end ANNOVAR for " `date`
Funcotator
GATK=gatk
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
interval=/home/allen/Project/wes2/4-GATK/annotation/funcotator/hg38.exon.interval_list
source=/media/sf_J_DRIVE/Project_WD/0.Anotation/funcotator/funcotator_dataSources.v1.7.20200521s
echo "start Funcotator for " `date`
$GATK Funcotator -R $ref \
-V /home/allen/Project/wes2/4-GATK/population/indel_snp_VQSR/merge.HC.indels.snps.VQSR.vcf \
-O /home/allen/Project/wes2/4-GATK/annotation/funcotator/merge_funcotator.tmp.maf \
--data-sources-path ${source} \
--output-file-format MAF \
--ref-version hg38 \
1>/home/allen/Project/wes2/4-GATK/annotation/funcotator/merge.funcotator.log 2>&1
echo "end Funcotator for " `date`
参考资料
- Variant Calling Pipeline using GATK4
- GATK流程找变异
- 生信软件分析环境搭建
- GATK 4.0 WGS germline call variant
- GATK4 最佳实践-生殖细胞突变的检测与识别
- WES/WGS-QC-SNV(GATK) | 自动