GATK4检测胚系突变(SNV&Indel)

Identifying genomic variants, including single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), from next generation sequencing data is an important part of scientific discovery.

To facilitate this research, a bioinformatics pipeline has been developed to enable researchers to accurately and rapidly identify, and annotate, sequence variants. The pipeline employs the Genome Analysis Toolkit 4 (GATK4, gatk-4.2.0.0) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute.

This pipeline is intended for calling variants in samples that are clonal – i.e. a single individual. The frequencies of variants in these samples are expected to be 1 (for haploids or homozygous diploids) or 0.5 (for heterozygous diploids).

Workflow Overview

Idealized workflow for calling germline short variants

Step 1. Alignment and sorting

---	---
Tool	BWA--MEM, SAMtools
Input	fastq files, reference genome
Output	$sample.sorted.bam
Notes	Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag. BamSortIing step can be done using the SAMtools
Command	bwa.sh

INDEX=/media/sf_J_DRIVE/Project_WD/0.Reference/index/bwa/hg38

cat /home/allen/Project/wes2/2-clean_data/bwa_config |while read id
do
arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
bwa mem -t 2 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2 | samtools sort -@ 2 -o /home/allen/Project/wes2/3-align/$sample.sorted.bam - 
done

Step 2. Collect Alignment & Insert Size Metrics

---	---
Tool	SAMtools
Input	$sample.sorted.bam, reference genome
Output	stats folder, $sample.sorted.html
Notes	collect bam metrics.
Command	stats.sh

cat align_config | while read id
do
  bam=/home/allen/Project/wes2/3-align/${id}.bam
  samtools stats -@4 --reference /media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta ${bam} > /home/allen/Project/wes2/3-align/stats/${id}.stat
  plot-bamstats -p /home/allen/Project/wes2/3-align/stats/${id}  /home/allen/Project/wes2/3-align/stats/${id}.stat
done

Step 3. Mark Duplicates

---	---
Tool	GATK4-MarkDuplicates, SAMtools
Input	$sample.sorted.bam
Output	[${id}.marked.bam], [$${id}.marked.bai]
Notes	After the MarkDuplicates step, bam metrics file will be given by -M, .bai file can be done using the samtools index command.
Command	gatk4.sh


gatk=gatk

cat  markDu_config | while read id
do
  BAM=/home/allen/Project/wes2/3-align/${id}.bam
  if [ ! -f  /home/allen/Project/wes2/4-GATK/ok.${id}.marked.status ]
  then
    echo "start MarkDuplicates for ${id}" 'date'
    gatk --java-options "-Xmx4g -Djava.io.temdir=/home" MarkDuplicates \
    -I ${BAM} \
    -O /home/allen/Project/wes2/4-GATK/${id}.marked.bam \
    -M /home/allen/Project/wes2/4-GATK/${id}.metrics \
    1>/home/allen/Project/wes2/4-GATK/${id}.mark.log 2>&1

    if [ $? -eq 0 ]
    then
      touch /home/allen/Project/wes2/4-GATK/ok.${id}.marked.status
    fi
    echo "end MarkDuplicates for ${id}" 'date'
    samtools index -@ 4 -b /home/allen/Project/wes2/4-GATK/${id}.marked.bam /home/allen/Project/wes2/4-GATK/${id}.marked.bai
  fi
done

Step 4. Base Quality Score Recalibration (BQSR)

---	---
Tool	GATK4-BaseRecalibrator and ApplyBQSR
Input	${id}.marked.bam, reference genome, known snp, known indel
Output	${id}.marked.bqsr.bam
Notes
Command	BQSR.sh

gatk=gatk
snp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/dbsnp_146.hg38.vcf.gz
indel=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta


cat bqsr_config | while read id
do 
  if [ ! -f /home/allen/Project/wes2/4-GATK/BQSR/${id}.marked.bqsr.bam ]
  then
    echo "start BQSR for ${id}" 'date'
    $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" BaseRecalibrator \
    -R $ref \
    -I /home/allen/Project/wes2/4-GATK/${id}.marked.bam \
    --known-sites ${snp} \
    --known-sites ${indel} \
    -O /home/allen/Project/wes2/4-GATK/BQSR/${id}.recal.table \
    1>/home/allen/Project/wes2/4-GATK/BQSR/${id}.recal.log 2>&1
    
    $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" ApplyBQSR \
    -R $ref \
    -I /home/allen/Project/wes2/4-GATK/${id}.marked.bam \
    -bqsr /home/allen/Project/wes2/4-GATK/BQSR/${id}.recal.table \
    -O /home/allen/Project/wes2/4-GATK/BQSR/${id}.marked.bqsr.bam
    1>/home/allen/Project/wes2/4-GATK/BQSR/${id}.ApplyBQSR.log 2>&1
    
    echo "end BQSR for ${id}" 'date'
  fi
done

Step 5. Call Variants

---	---
Tool	GATK4
Input	${id}.marked.bqsr.bam, reference genome, known snp, known indel, bed
Output	merge.vcf
Notes	Variant calling with GenomicsDBImport method.
Command	germline.sh

gatk=gatk
snp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/dbsnp_146.hg38.vcf.gz
indel=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
bed=/media/sf_J_DRIVE/Project_WD/Osteopetrosis/hglft-hg38.bed

cat germline_config | while read id
do
  echo "start HC for ${id}" 'date'
  $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" HaplotypeCaller -ERC GVCF \
  -R ${ref} \
  -I /home/allen/Project/wes2/4-GATK/BQSR/${id}.marked.bqsr.bam \
  --dbsnp ${snp} \
  -L ${bed} \
  -O /home/allen/Project/wes2/4-GATK/gvcf/${id}.raw.vcf \
  1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.HC.log 2>&1 
  echo "finish HC for ${id}" 'date'

done

cd /home/allen/Project/wes2/4-GATK/gvcf
for chr in chr{1..22}
do
time $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" GenomicsDBImport \
-R ${ref} \
$(ls /home/allen/Project/wes2/4-GATK/gvcf/*raw.vcf | awk '{print "-V "$0" "}') \
-L ${chr} \
--genomicsdb-workspace-path gvcf_${chr}.db
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.GenomicsDBImport.log 2>&1

time $gatk --java-options "-Xmx4g -Djava.io.temdir=/home" GenotypeGVCFs \
-R ${ref} \
-V gendb://gvcf_${chr}.db \
-O gvcfs_${chr}.vcf
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.GenotypeGVCFs.log 2>&1
done

$gatk --java-options "-Xmx4g -Djava.io.temdir=/home" GatherVcfs \
$(for i in {1..22}; do echo "-I gvcfs_chr${i}.vcf"; done) \
-O merge.vcf
1>/home/allen/Project/wes2/4-GATK/gvcf/${id}.GatherVcfs.log 2>&1

Step 6. Variant Filtration

---	---
Tool	GATK4
Input	merge.vcf, reference genome, known snp, known indel, hapmap, omni
Output	merge.HC.indels.snps.VQSR.vcf
Notes	GATK HaplotypeCaller之后，首选的质控方案是GATK VQSR，它通过机器学习的方法利用多个不同的数据特征训练一个模型（高斯混合模型）对变异数据进行质控，然而不幸的是使用VQSR需要具备以下两个条件：第一，需要一个精心准备的已知变异集，它将作为训练质控模型的真集。第二，要求新检测的结果中有足够多的变异，不然VQSR在进行模型训练的时候会因为可用的变异位点数目不足而无法进行。由于条件1的限制，会导致很多非人的物种在完成变异检测之后没法使用GATK VQSR的方法进行质控。而由于条件2，也常常导致一些小panel甚至外显子测序，由于最后的变异位点不够，也无法使用VQSR。这个时候，我们就不得不选择硬过滤的方式来质控了。那什么叫做硬过滤呢？所谓硬过滤其实就是通过人为设定一个或者若干个指标阈值（也可以叫数据特征值），然后把所有不满足阈值的变异位点采用一刀切掉的方法。
Command	VQSR(snp_indel_VQSR.sh), Hard-filtering

VQSR

##先对Indels进行过滤，再对snp进行过滤

gatk=gatk
hapmap=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/hapmap_3.3.hg38.vcf.gz
omni=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/1000G_omni2.5.hg38.vcf.gz
snp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
dbsnp=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/dbsnp_146.hg38.vcf.gz
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
indel=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

##Indel Mode

time $gatk VariantRecalibrator \
   -R $ref \
   -V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
   --resource:mills,known=true,training=true,truth=true,prior=12.0 $indel \
   --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp \
   -an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
   -mode INDEL \
   --max-gaussians 4 \
   --rscript-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.plots.R \
   --tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.tranches \
   -O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.recal \
   1>/home/allen/Project/wes2/4-GATK/population/merge.indel.vqsr.log 2>&1
   
time $gatk ApplyVQSR \
   -R $ref \
   -V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
   --truth-sensitivity-filter-level 99.0 \
   --tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.tranches \
   --recal-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.recal \
   -mode INDEL \
   -O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.VQSR.vcf \
   1>/home/allen/Project/wes2/4-GATK/population/merge.indels.ApplyVQSR.log 2>&1
   echo "** Indels VQSR done **"


##SNP Mode

time $gatk VariantRecalibrator \
   -R $ref \
   -V /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.VQSR.vcf \
   --resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
   --resource:omini,known=false,training=true,truth=false,prior=12.0 $omni \
   --resource:1000G,known=false,training=true,truth=false,prior=10.0 $snp \
   --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $dbsnp \
   -an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
   -mode SNP \
   -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
   --rscript-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.plots.R \
   --tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.tranches \
   -O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.recal  \
   1>/home/allen/Project/wes2/4-GATK/population/merge.indels.snps.vqsr.log 2>&1 

time $gatk ApplyVQSR \
   -R $ref \
   -V /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.VQSR.vcf \
   --truth-sensitivity-filter-level 99.0 \
   --tranches-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.tranches \
   --recal-file /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.recal \
   -mode SNP \
   -O /home/allen/Project/wes2/4-GATK/population/merge.HC.indels.snps.VQSR.vcf \
   1>/home/allen/Project/wes2/4-GATK/population/merge.indels.snps.ApplyVQSR.log 2>&1
   echo "** SNPs VQSR done **"

Hard-filtering

那么如何执行硬过滤？首先，需要我们确定该用哪些指标来评价变异的好坏。这个非常重要，选择对了事半功倍，选得不合理，过滤的结果有时还不如不过滤的。如果把这个问题放在从前，我们需要做比较多的尝试才能确定一些合适的指标，但现在就方便很多了，可以直接使用GATK VQSR所用的指标——毕竟这些指标都是经过精挑细选的。我想这应该不难理解，既然VQSR就是用这些指标来训练质控模型的，那么它们就可以在一定程度上描述每个变异的质量，我们用这些指标设置对应的阈值来进行硬过滤也将是合理的。VQSR使用的数据指标有6个（这些指标都在VCF文件的INFO域中，如果不是GATK得到的变异，可能会有所不同，但知道它们的含义之后也是可以自己计算的），分别是：

QualByDepth（QD）
FisherStrand (FS)
StrandOddsRatio (SOR)
RMSMappingQuality (MQ)
MappingQualityRankSumTest (MQRankSum)
ReadPosRankSumTest (ReadPosRankSum)

指标有了，那么阈值应该设置为多少？下面我想先给出一个硬过滤的例子，然后再逐个来对其进行分析，以便大家能够更好地理解变异质控的思路。值得注意的是不同的数据，有不同的情况，它的阈值有时是不同的。不过不用担心，当你掌握了如何做的思路之后完全有能力根据具体的情况举一反三。

首先是硬过滤的例子，这个过程我都用最新的GATK来完成。GATK 4.0中有一个专门的VariantFiltration模块（继承自GATK 3.x），它可以很方便地帮我们完成这个事情。不过，过滤的时候，需要分SNP和Indel这两个不同的变异类型来进行，它们有些阈值是不同的，需要区别对待。

gatk=gatk

##Subset to SNPs-only callset with SelectVariants

$gatk SelectVariants \
    -V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
    -select-type SNP \
    -O /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps.vcf.gz \
    1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/select.snps.log 2>&1
    
##Hard-filter SNPs on multiple expressions using VariantFiltration

$gatk VariantFiltration \
    -V /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 30.0" --filter-name "QUAL30" \
    -filter "SOR > 3.0" --filter-name "SOR3" \
    -filter "FS > 60.0" --filter-name "FS60" \
    -filter "MQ < 40.0" --filter-name "MQ40" \
    -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
    -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    -O /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps_hard_filtered.vcf.gz \
    1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/snps.hardfilter.log 2>&1


##Subset to indels-only callset with SelectVariants
  
$gatk SelectVariants \
    -V /home/allen/Project/wes2/4-GATK/gvcf/merge.vcf \
    -select-type INDEL \
    -O /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.vcf.gz \
    1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/select.indels.log 2>&1
    
    
##Similarly, hard-filter indels on multiple expressions using VariantFiltration
    
$gatk VariantFiltration \
    -V /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 30.0" --filter-name "QUAL30" \
    -filter "FS > 200.0" --filter-name "FS200" \
    -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    -O /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels_hard_filtered.vcf.gz \
    1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.hardfilter.log 2>&1
  
##融合过滤后的indel和snp  
    
$gatk MergeVcfs \
    -I /home/allen/Project/wes2/4-GATK/population/Hard_filter/snps_hard_filtered.vcf.gz \
    -I /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels_hard_filtered.vcf.gz \
    -O /home/allen/Project/wes2/4-GATK/population/Hard_filter/indels_snp_hard_filtered.vcf.gz \
    1>/home/allen/Project/wes2/4-GATK/population/Hard_filter/indels.snps.merge.log 2>&1

Step 7. Variant Annotation

---	---
Tool	Annovar, Funcotator
Input	merge.HC.indels.snps.VQSR.vcf
Output	merge.hg38.annovar.txt.hg38_multianno.txt, merge_funcotator.tmp.maf
Notes	注意注释软件的数据库版本
Command	annovar.sh, funcotator.sh

Annovar

echo "start ANNOVAR for  " `date`
/home/allen/annovar/table_annovar.pl /home/allen/Project/wes2/4-GATK/population/indel_snp_VQSR/merge.HC.indels.snps.VQSR.vcf  /home/allen/annovar/humandb/ \
-buildver hg38 \
-out /home/allen/Project/wes2/4-GATK/annotation/annovar/merge.hg38.annovar.txt \
-remove \
-protocol clinvar_20210501 \
-operation f \
-nastring . \
-vcfinput \
1>/home/allen/Project/wes2/4-GATK/annotation/annovar/annovar.log 2>&1
echo "end ANNOVAR for  " `date`

Funcotator

GATK=gatk
ref=/media/sf_J_DRIVE/Project_WD/0.Reference/GATK/hg38/Homo_sapiens_assembly38.fasta
interval=/home/allen/Project/wes2/4-GATK/annotation/funcotator/hg38.exon.interval_list
source=/media/sf_J_DRIVE/Project_WD/0.Anotation/funcotator/funcotator_dataSources.v1.7.20200521s

echo "start Funcotator for  " `date`
$GATK  Funcotator -R $ref \
-V /home/allen/Project/wes2/4-GATK/population/indel_snp_VQSR/merge.HC.indels.snps.VQSR.vcf \
-O /home/allen/Project/wes2/4-GATK/annotation/funcotator/merge_funcotator.tmp.maf \
--data-sources-path ${source} \
--output-file-format MAF \
--ref-version hg38 \
1>/home/allen/Project/wes2/4-GATK/annotation/funcotator/merge.funcotator.log  2>&1
echo "end Funcotator for " `date`

GATK4检测胚系突变(SNV&Indel)

GATK4检测胚系突变(SNV&Indel)

Workflow Overview

Step 1. Alignment and sorting

Step 2. Collect Alignment & Insert Size Metrics

Step 3. Mark Duplicates

Step 4. Base Quality Score Recalibration (BQSR)

Step 5. Call Variants

Step 6. Variant Filtration

VQSR

Hard-filtering

Step 7. Variant Annotation

Annovar

Funcotator

参考资料

推荐阅读更多精彩内容