解决的问题
根据癌症病人的肿瘤组织和正常组织的测序数据,使用一系列软件对其进行数据处理和分析,找出病人的体细胞变异(主要是SNP和indel),然后对变异进行注释与解读,并根据变异信息精准地推荐可能对该名患者疗效较好的药物。
前期准备工作
首先是搭建 somatic SNP+indel calling pipeline
使用Trimmomatic进行reads过滤,BWA进行reads比对,Samtools进行bam文件排序与建索引等操作,GATK4.0.11.0进行去重复和碱基质量分数校正。由于后续用GATK Mutect2进行变异分析,而Mutect2类似HaplotypeCaller,会进行局部重比对,故在此处的数据预处理部分省略了Indel局部重比对这步。
注:参考基因组数据(hg38)和一些辅助数据来自GATK Resource Bundle.
师兄参考碱基矿工写的测序数据预处理shell脚本(preprocess.sh)如下:
(Markdown语法显示shell脚本时语法高亮似乎有些小问题)
#!/bin/bash
# set some paths
tool_path="$HOME/diploma_project/Tools"
data_path="$HOME/diploma_project/Data"
trimmomatic="${tool_path}/Trimmomatic-0.38/trimmomatic-0.38.jar"
trimmomatic_path=${trimmomatic%/*}
bwa="${tool_path}/bwa-0.7.17/bwa"
samtools="${tool_path}/samtools/bin/samtools"
gatk="${tool_path}/gatk-4.0.11.0/gatk"
#reference download from "gatk bundle" website
reference="${data_path}/reference/hg38"
GATK_bundle="${data_path}/GATK_bundle/hg38_from_ck"
# $gatk IndexFeatureFile --feature-file $GATK_bundle/hapmap_3.3.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/1000G_omni2.5.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/1000G_phase1.snps.high_confidence.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/dbsnp_146.hg38.vcf
# pay attention to "read group header"(GATK website)
if [ $# -lt 6 ]; then #num_of_parameter < 6
echo "usage: $0 fq1 fq2 Read_Group_ID(Lane ID) library sample_ID outdir [number_threads]"
exit 1
fi
fq1=$1 #normal / cancer
fq2=$2 #normal / cancer
RGID=$3 ## Read Group ID,generally can be replaced by Lane ID
library=$4 ## Sequencing library ID
sample=$5 ## Sample ID
outdir=$6 ## Output directory path
RGPL="ILLUMINA"
RGPU=$RGID
#set number of threads
if [ -n "$7" ]; then #if [ $# -lt 7 ]
nt=$7
else
nt=4
fi
## create folder named by the sample
outdir=${outdir}/${sample}
## Acquire the filename of the fastq file, assuming that fq1 and fq2 have the same prefix name
## Get rid of the path prefix
fq_file_name=`basename $fq1`
## Get rid of the suffix,only keep the filename. Consider two possibilities(match and delete the char after %%)
fq_file_name=${fq_file_name%%.R1.fq.gz}
fq_file_name=${fq_file_name%%.R1.fastq.gz}
# output diretory
if [ ! -d $outdir/cleanfq ]; then
mkdir -p $outdir/cleanfq
fi
if [ ! -d $outdir/bwa ]; then
mkdir -p $outdir/bwa
fi
if [ ! -d $outdir/gatk ]; then
mkdir -p $outdir/gatk
fi
echo -e "\n"
echo "RUN info"
echo "fastq1 : $(basename $fq1)"
echo "fastq2 : $(basename $fq2)"
echo "sample ID : $sample"
echo "output dir : $outdir"
echo "threads : $nt"
echo -e "\n*** Started at $(date +'%T %F') ***\n"
## Perform QC to the raw reads using Trimmomatic, where an important parameter, keepBothReads is set True in the ILLUMINACLIP step.
if [ ! -e $outdir/cleanfq/${fq_file_name}.unpaired.2.fq.gz ]; then
java -jar ${trimmomatic} PE \
-threads $nt \
$fq1 $fq2 \
$outdir/cleanfq/${fq_file_name}.paired.1.fq.gz \
$outdir/cleanfq/${fq_file_name}.unpaired.1.fq.gz \
$outdir/cleanfq/${fq_file_name}.paired.2.fq.gz \
$outdir/cleanfq/${fq_file_name}.unpaired.2.fq.gz \
ILLUMINACLIP:$trimmomatic_path/adapters/TruSeq3-PE-2.fa:2:30:10:8:True \
SLIDINGWINDOW:5:15 LEADING:5 TRAILING:5 MINLEN:50 && echo -e "\n*** fq QC done at $(date +'%T %F') ***\n"
fi
## Use bwa mem to align reads
## -M : Mark shorter split hits as secondary (for Picard compatibility) -Y : soft clipping
## Use samtools to convert sam file to bam file (Samtools is designed to work on a stream. It regards an input file '-' as the standard input)
$bwa mem -t $nt -M -Y \
-R "@RG\tID:$RGID\tPL:$RGPL\tPU:$RGPU\tLB:$library\tSM:$sample" \
$reference/Homo_sapiens_assembly38.fasta \
$outdir/cleanfq/${fq_file_name}.paired.1.fq.gz \
$outdir/cleanfq/${fq_file_name}.paired.2.fq.gz | \
$samtools view -Sb - > $outdir/bwa/${sample}.bam && \
echo -e "\n*** BWA MEM done at $(date +'%T %F') ***\n" && \
# $samtools sort -@ 4 -m 4G -O bam -o $outdir/bwa/${sample}.sorted.bam \
$samtools sort -@ 2 -m 2G -O bam -o $outdir/bwa/${sample}.sorted.bam \
$outdir/bwa/${sample}.bam && echo -e "\n*** sorted raw bamfile done at $(date +'%T %F') ***\n"
## Identifies duplicate reads -M File to write duplication metrics to
$gatk MarkDuplicates \
-I $outdir/bwa/${sample}.sorted.bam \
-M $outdir/bwa/${sample}.markdup_metrics.txt \
-O $outdir/bwa/${sample}.sorted.markdup.bam && echo -e "\n*** ${sample}.sorted.bam MarkDuplicates done at $(date +'%T %F') ***\n" || exit 1
## index a coordinate-sorted BAM file for fast random access. Output .bam.bai file
$samtools index $outdir/bwa/${sample}.sorted.markdup.bam && \
echo -e "\n*** ${sample}.sorted.markdup.bam index done at $(date +'%T %F') ***\n" || exit 1
# Maybe need local realignment?
# Base quality score recalibration(BQSR)
# Does your vcf file have an index? GATK4 does not support on the fly indexing of VCFs anymore.
# Detect systematic errors in base quality scores
$gatk BaseRecalibrator \
-R $reference/Homo_sapiens_assembly38.fasta\
-I $outdir/bwa/${sample}.sorted.markdup.bam \
--known-sites $GATK_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--known-sites $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites $GATK_bundle/dbsnp_146.hg38.vcf.gz \
-O $outdir/bwa/${sample}.sorted.markdup.recal_data.table && \
echo -e "\n*** ${sample}.sorted.markdup.recal_data.table creation done at $(date +'%T %F') ***\n" || exit 1
# Recalibrate the base qualities of the input reads based on the recalibration table produced by the BaseRecalibrator tool, and outputs a recalibrated BAM or CRAM file.
$gatk ApplyBQSR \
--bqsr-recal-file $outdir/bwa/${sample}.sorted.markdup.recal_data.table \
-R $reference/Homo_sapiens_assembly38.fasta \
-I $outdir/bwa/${sample}.sorted.markdup.bam \
-O $outdir/bwa/${sample}.sorted.markdup.BQSR.bam && \
echo -e "\n*** ApplyBQSR done at $(date +'%T %F') ***\n" || exit 1
$samtools index $outdir/bwa/${sample}.sorted.markdup.BQSR.bam && \
echo -e "\n*** ${sample}.sorted.markdup.BQSR.bam index done at $(date +'%T %F') ***\n"
# remove some useless file, only keep ${sample}.sorted.markdup.BQSR.bam
rm -f $outdir/bwa/${sample}.bam $outdir/bwa/${sample}.sorted.bam $outdir/bwa/${sample}.sorted.markdup.bam
预处理过后,使用Mutect2并行地查找变异,然后使用GetPileupSummaries,CalculateContamination和FilterMutectCalls过滤变异,得到最终结果。
师兄参考GATK官网和网上某博客写的somatic variants calling的shell脚本(call_somatic.sh)如下:
#!/bin/bash
# set some paths
tool_path="$HOME/diploma_project/Tools"
data_path="$HOME/diploma_project/Data"
gatk="${tool_path}/gatk-4.0.11.0/gatk"
# buid_version="hg38"
reference="${data_path}/reference/hg38"
GATK_bundle="${data_path}/GATK_bundle/hg38_from_ck"
prog=$0 #script name including path
bn=$(basename $0)
bin_dir=${prog%\/$bn} #same as bin_dir=${prog%/*}
bam_SM="$bin_dir/bam_SM.py" #bam_SM.py is a script that recognize the sample name from a bam file
# echo "bam_SM: $bam_SM"
if [ $# -lt 4 ]; then
echo "usage: $0 sample_name normal_bam tumor_bam outdir"
exit 1
fi
sample=$1 #e.g. humanxianshi
normal_bam=$2
tumor_bam=$3
outdir=$4
outdir=${outdir%/} #get rid of the last "/" if exist
if [ ! -d $outdir ]; then
mkdir $outdir
fi
if [ ! -d $outdir/chromosomes ]; then
mkdir $outdir/chromosomes
fi
if [ ! -d $outdir/other ]; then
mkdir $outdir/other
fi
echo -e "\n***Output to $outdir***\n"
echo -e "***Started at $(date +"%T %F")***\n"
chroms="chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY"
for i in $chroms; do
$gatk --java-options "-Xmx1G" Mutect2 \
-R $reference/Homo_sapiens_assembly38.fasta \
-I $tumor_bam \
-tumor $($bam_SM $tumor_bam) \
-L $i \
-I $normal_bam \
-normal $($bam_SM $normal_bam) \
-pon $GATK_bundle/somatic-hg38-1000g_pon.hg38.vcf.gz \
--germline-resource $GATK_bundle/af-only-gnomad.hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.00003125 \
-O $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz && \
echo -e "\n*** Mutect2 $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz done at $(date +"%T %F") ***\n" || exit 1 & #put the process background
done && wait #wait until all the process finish
merge_vcfs_cmd=""
for i in $chroms; do
merge_vcfs_cmd=${merge_vcfs_cmd}"-I $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz "
done && $gatk MergeVcfs ${merge_vcfs_cmd} -O $outdir/${sample}_somatic_unfiltered.vcf.gz && \
echo -e "\n*** MergeVcfs ${outdir}/${sample}_somatic_unfiltered.vcf.gz done at $(date +"%T %F") ***\n" || exit 1
## Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.This estimation informs downstream filtering by FilterMutectCalls.
$gatk --java-options "-Xmx4G" GetPileupSummaries \
-I $tumor_bam \
-L $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-V $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-O $outdir/other/${sample}_tumor_getpileupsummaries.table && \
echo -e "\n*** GetPileupSummaries done at $(date +"%T %F") ***\n" || exit 1
$gatk --java-options "-Xmx4G" CalculateContamination \
-I $outdir/other/${sample}_tumor_getpileupsummaries.table \
-O $outdir/other/${sample}_tumor_calculatecontamination.table && \
echo -e "\n*** CalculateContamination done at $(date +"%T %F") ***\n" || exit 1
## Filter for confident somatic calls using FilterMutectCalls
## This produces a VCF callset and index. Calls that are likely true positives get the PASS label in the FILTER field,
## and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF.
$gatk --java-options "-Xmx4G" FilterMutectCalls \
-V $outdir/${sample}_somatic_unfiltered.vcf.gz \
--contamination-table $outdir/other/${sample}_tumor_calculatecontamination.table \
-O $outdir/${sample}_somatic_oncefiltered.vcf.gz && \
echo -e "\n*** FilterMutectCalls done at $(date +"%T %F") ***\n" || exit 1
## Select a subset of variants from a VCF file
$gatk SelectVariants \
-V $outdir/${sample}_somatic_oncefiltered.vcf.gz \
-O $outdir/${sample}_somatic_oncefiltered.PASS.vcf.gz \
--exclude-filtered && \
echo -e "\n*** SelectVariants: $outdir/${sample}_somatic_oncefiltered.PASS.vcf.gz done at $(date +"%T %F") ***\n"
小插曲
当我使用IMPACT上的测试数据跑通了这两个脚本后,开始学习GATK关于CNV calling的相关软件的文档,准备搭建CNV calling pipeline。然而老师说CNV对药物推荐的作用相对较小,不太重要,可以不做。重点应转向评测这个pipeline的准确率如何。
老师让我模仿DeepVariant的benchmark方法进行评测。看完论文发现DeepVariant研究的是从正常人的测序数据中寻找变异,用的变异数据金标准均为germline variants,而我研究的是somatic variants,故不能适用。后来发现可以在正常样本中人工加入变异构造模拟的肿瘤样本,从而得到金标准。然而老师说不要用模拟数据,要用真实的数据。虽然我发现好像现在很多的somatic variant calling algorithm的benchmark都是用模拟数据进行的。
Benchmarking somatic mutation calling pipeline
Data source: TCGA GDC Data Portal
主要目的:TCGA官网上有预处理过的normal-tumor配对的bam文件以及对应的用某个软件(如Mutect2)得到的含体细胞变异信息的vcf文件,使用目前较流行的RTG-Tools的vcfeval模块,可将我得到的vcf与官方提供的vcf进行比较,从而初步判断我的pipeline是否正确地搭建。
12.10
从三种疾病中分别挑选一个样本进行评测:
- Lung Squamous Cell Carcinoma (LUSC): TCGA-58-8391
- Head and Neck Squamous Cell Carcinoma (HNSC): TCGA-IQ-A61H
- Sarcoma (SARC): TCGA-DX-A7ET
按照上述脚本,使用Mutect2 找出变异,经过过滤后,与TCGA上的vcf比较,F值最高的是肺鳞状细胞瘤(LUSC)样本,约为0.9;头颈鳞状细胞瘤(HNSC)样本F值为0.86;F值最低为肉瘤(SARC)样本,约为0.72。
12.11
下载并测试了另一个肉瘤样本TCGA-Z4-A9VC(记作SARC2),这次F值仅有约0.61。
为什么肉瘤样本的准确率会特别低呢?
LUSC样本TCGA上的vcf中PASS变异有974个;HNSC样本TCGA上的vcf中PASS变异有367个;SARC 1号样本TCGA上的vcf中PASS变异仅有135个,而SARC 2号样本仅有77个。
由于SARC 1号和2号的样本太小,测出的F值可能不太可靠,应该下载一个具有较多变异的肉瘤样本重新测试。
12.12
下载了第三个肉瘤样本TCGA-3B-A9HT(记作SARC3),变异数为1050个,重跑流程,结果测得F值为0.8。
测试结果汇总如下:
Sample | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|---|
LUSC | 898 | 129 | 65 | 0.8744 | 0.9333 | 0.9029 |
HNSC | 346 | 90 | 22 | 0.7936 | 0.9402 | 0.8607 |
SARC1 | 115 | 71 | 19 | 0.6183 | 0.8603 | 0.7195 |
SARC2 | 58 | 57 | 19 | 0.5043 | 0.7564 | 0.6052 |
SARC3 | 901 | 306 | 145 | 0.7465 | 0.8620 | 0.8001 |
数据和方法都一样,为什么准确率还是不够高呢?
最重要的原因之一:
从TCGA的官方数据分析流程中可以看到,TCGA Mutect2使用的PoN(Panel of Normals)是使用TCGA上几千个正常人的血液样本构建的:
“The MuTect2 pipeline employs a "Panel of Normals" to identify additional germline mutations. This panel is generated using TCGA blood normal genomes from thousands of individuals that were curated and confidently assessed to be cancer-free. This method allows for a higher level of confidence to be assigned to somatic variants that were called by the MuTect2 pipeline.”
而我的pipeline上Mutect2使用的PoN是GATK官网上给出的,这个PoN是用千人基因组上的正常人样本构建。
GATK Doc#11136上指出:
"Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches."
也就是说构建PoN的样本与研究变异的样本各方面条件应该相似。我测试时用的是TCGA上的tumor-normal样本,构建PoN用的却是千人基因组中的正常样本,故得到的结果会与TCGA上的vcf有一定差异。
另外,TCGA官方流程还设置了--contamination_fraction_to_filter
为0.02 ,以及使用了cosmic.vcf
和dbsnp.vcf
:
#GATK 3
java -jar GenomeAnalysisTK.jar \
-T MuTect2 \
-R <reference> \
-L <region> \
-I:tumor <tumor.bam> \
-I:normal <normal.bam> \
--normal_panel <pon.vcf> \
--cosmic <cosmic.vcf> \
--dbsnp <dbsnp.vcf> \
--contamination_fraction_to_filter 0.02 \
-o <mutect_variants.vcf> \
--output_mode EMIT_VARIANTS_ONLY \
--disable_auto_index_creation_and_locking_when_reading_rods
然而,将--contamination_fraction_to_filter
从默认值(0)改为0.02后,在LUSC和HNSC两个样本上测试,结果和之前完全一样。所以这个参数应该对结果影响不大,可暂时不理会。
GATK Mutect2 官网文档中有关于--cosmic
和--dbsnp
两个参数的描述:
“MuTect2 has the ability to use COSMIC data in conjunction with dbSNP to adjust the threshold for evidence of a variant in the normal. If a variant is present in dbSNP, but not in COSMIC, then more evidence is required from the normal sample to prove the variant is not present in germline.”
dbSNP数据可在GATK Resource Bundle中找到,而COSMIC相关数据可在官网中下载(编码和非编码区的变异数据都要下载),经过一定处理后,得到最终的COSMIC数据文件。(合并,改坐标,排序,重建索引)
然而,GATK4已经不支持这两个参数了!而是改用了--germline-resource
这个参数。这个参数接受的数据文件也在GATK Resource Bundle中给出。若想使用dbSNP和COSMIC的数据,可考虑将Mutect2这一部分流程更换回GATK3,但是根据GATK论坛上官方人员所说,似乎这样做没有必要。
"I think the team found in testing that using a germline resource, PoN, and matched normal is enough to filter out possible germline variants. The variants in the tumor sample that pass the thresholds are most likely somatic mutations, so there is no need for a "whitelist"."
根据GATK4官方教程,调整Mutect2软件参数
调整--af-of-alleles-not-in-resource
的值
师兄根据网上教程,将--af-of-alleles-not-in-resource
的值设为0.00003125。根据GATK官网教程的推荐,当处理的数据为外显子测序数据且--germline-resource
用GATK Resource Bundle中的af-only-gnomad_grch38.vcf.gz
时,--af-of-alleles-not-in-resource
的值应设为0.0000025。
更改参数值前后的结果对比如下(粗体为更改参数值后的结果):
Sample | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|---|
LUSC | 898 | 129 | 65 | 0.8744 | 0.9333 | 0.9029 |
LUSC | 912 | 137 | 51 | 0.8694 | 0.9477 | 0.9069 |
HNSC | 346 | 90 | 22 | 0.7936 | 0.9402 | 0.8607 |
HNSC | 350 | 92 | 18 | 0.7919 | 0.9511 | 0.8642 |
SARC3 | 901 | 306 | 145 | 0.7465 | 0.8620 | 0.8001 |
SARC3 | 906 | 318 | 140 | 0.7402 | 0.8668 | 0.7985 |
可以看到,修改此参数后,可提高Sensitivity,但会降低Precision,即会引入更多的假阳性结果。最终的F值在LUSC和HNSC样本中稍有提升,但在SARC3样本中反而略有下降。总体而言F值变化不大。综合考虑,F值不变的情况下,为了使变异尽可能被找到,灵敏度提升带来的好处应该要大于假阳性结果增加带来的坏处。故--af-of-alleles-not-in-resource
可设为0.0000025。
进一步添加参数 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
“This filter removes from analysis paired reads whose mate maps to a different contig.”
GATK官网教程中添加了此参数,禁用此过滤器,目的是为了将配对的read map到不同染色体上的那些reads也纳入考虑,从而使得可供分析的read更多。经过在LUSC、HNSC和SARC3三个样本中测试,F值等结果和之前的毫无变化。
根据GATK4官方教程,调整过滤步骤的软件参数
1. 分别对normal和tumor样本做GetPileupSummaries
"CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument."
修改后的代码为:
## Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.
## This estimation informs downstream filtering by FilterMutectCalls.
$gatk --java-options "-Xmx4G" GetPileupSummaries \
-I $normal_bam \
-L $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-V $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-O $outdir/other/${sample}_normal_getpileupsummaries.table || exit 1
$gatk --java-options "-Xmx4G" GetPileupSummaries \
-I $tumor_bam \
-L $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-V $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-O $outdir/other/${sample}_tumor_getpileupsummaries.table && \
echo -e "\n*** GetPileupSummaries done at $(date +"%T %F") ***\n" || exit 1
$gatk --java-options "-Xmx4G" CalculateContamination \
-I $outdir/other/${sample}_tumor_getpileupsummaries.table \
-matched $outdir/other/${sample}_normal_getpileupsummaries.table \
-O $outdir/other/${sample}_tumor_calculatecontamination.table && \
echo -e "\n*** CalculateContamination done at $(date +"%T %F") ***\n" || exit 1
然而,经过在LUSC,HNSC和SARC3这三个样本中测试,结果毫无变化。
最新的代码还是决定保留这个修改。
2. 更换GetPileupSummaries步骤中所用的common germline variant sites VCF
GetPileupSummaries会接受一个common germline variant sites VCF,统计样本reads在这些变异位点的分布情况(Summarizes counts of reads that support reference, alternate and other alleles for given sites)。这个文件之前用的是GATK Resource Bundle上给出的small_exac_common_3.hg38.vcf.gz
,现根据网上一篇教程,尝试自己生成这样的VCF。
在GATK GetPileupSummaries工具文档中可以看到:
The tool requires a common germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF. The tool ignores the filter status of the variant calls in this germline resource.
而这个网上教程的操作和这段话是吻合的。
首先,使用GATK SelectVariants,挑出GATK Resource Bundle中的af-only-gnomad.hg38.vcf.gz
文件中的Biallelic SNP:
/home/yuanqm/diploma_project/Tools/gatk-4.0.11.0/gatk SelectVariants \
-R /home/yuanqm/diploma_project/Data/reference/hg38/Homo_sapiens_assembly38.fasta \
-V /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.vcf.gz \
--select-type-to-include SNP \
--restrict-alleles-to BIALLELIC \
-O /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.vcf.gz
此时,如果直接用af-only-gnomad.hg38.SNP_biallelic.vcf.gz
这个文件,运行旧的脚本,会出现两个问题:
- 内存溢出。此压缩包有3.5G,解压后更是达到10G以上,加载此文件会占用大量内存,而--java-options "-Xmx4G"限制了可用内存最大为4G。然而,即使调高JVM的最大可用内存,或者去掉这一限制,仍然会报错,因为GetPileupSummaries -L和-V读取这个VCF文件两次,超过了本计算机的最大可用内存(16G),故应将程序迁移至服务器上运行。
- 坐标报错:
A USER ERROR has occurred: Badly formed genome unclippedLoc:
Contig chr1_KI270766v1_alt given as location,
but this contig isn't present in the Fasta sequence dictionary.
解决方案:遍历一次af-only-gnomad.hg38.SNP_biallelic.vcf.gz
,检查其中每一个变异的坐标,若其坐标没有出现在Homo_sapiens_assembly38.dict
中,则将此条变异删除。
另外,查看GetPileupSummaries之前所用的文件small_exac_common_3.hg38.vcf.gz
可发现里面的变异只有chr1-23、chrX和chrY。
所以,若直接将af-only-gnomad.hg38.SNP_biallelic.vcf.gz
中chr1-23、chrX和chrY的变异挑出,舍弃其他坐标的变异,可以简化处理文件的流程,且得到的文件也能比之前的small_exac_common_3.hg38.vcf.gz
有所改进。
处理af-only-gnomad.hg38.SNP_biallelic.vcf.gz
的流程如下:
cd /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck
## 由于grep不能处理二进制文件(如压缩包),故应先将af-only-gnomad.hg38.SNP_biallelic.vcf.gz解压
grep '^#' af-only-gnomad.hg38.SNP_biallelic.vcf > af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr[1-9][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr1[0-9][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr2[0-3][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr[XYM][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
## 再将得到的文件使用bcftools压缩和建索引
/home/yuanqm/diploma_project/Tools/bcftools/bin/bcftools \
view /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.selected.vcf \
-Oz -o /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
/home/yuanqm/diploma_project/Tools/bcftools/bin/bcftools index -t \
/home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
12.25
使用af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
在服务器上重跑程序,运行了将近20小时后,依然出现内存溢出错误。
因为样本是使用外显子测序,故可将af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
中的外显子变异挑出作为新的输入,从而进一步减少内存负荷。
然而,GATK tutorial中指出:
So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.
由此可见,基于样本间污染的过滤,对最后结果的影响不会特别大,故此部分的优化工作可暂时放一放。
12.26
将af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
中的外显子变异挑出作为新的输入:
$gatk SelectVariants \
-R path/Homo_sapiens_assembly38.fasta \
-V path/af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz \
-L path/liftover_37To38_exome.targets.bed \
-O path/af-only-gnomad.hg38.SNP_biallelic.selected.exome.vcf.gz
-L参数接受的文件标识了外显子的区域,也就是要挑选的区域。
此文件的获取方法如下:
首先在1000 Genome的网站上可获得1KGP.exome.targets.bed
,此文件的坐标为hg19,然后使用liftOver工具进行坐标转换,得到liftover_37To38_exome.targets.bed
。需要注意的是,为了减少失败的转换,根据开发者的建议,可勾选“Allow multiple output regions”选项,“Minimum ratio of bases that must remap”的值也可尝试适当调低(本人没有更改这个值)。
af-only-gnomad.hg38.SNP_biallelic.selected.exome.vcf.gz
约60M,不用担心内存溢出的问题,将此文件作为GetPileupSummaries -L和-V的参数,得到的结果如下:
Sample | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure | |
---|---|---|---|---|---|---|---|
LUSC | 912 | 137 | 51 | 0.8694 | 0.9477 | 0.9069 | |
Previous | HNSC | 350 | 92 | 18 | 0.7919 | 0.9511 | 0.8642 |
SARC | 906 | 318 | 140 | 0.7402 | 0.8668 | 0.7985 | |
LUSC | 912 | 136 | 51 | 0.8702 | 0.9477 | 0.9073 | |
Current | HNSC | 350 | 91 | 18 | 0.7937 | 0.9511 | 0.8653 |
SARC | 906 | 320 | 140 | 0.7390 | 0.8668 | 0.7978 |
由此可见,最终结果与之前相比确实变化不大,这与GATK tutorial中“基于污染的过滤重要性相对较小”的观点是吻合的。
最终的脚本还是选择用回之前GATK Resource Bundle中给出的small_exac_common_3.hg38.vcf.gz
。
总结:
将三个样本的数据综合起来,可得到GATK4的最终结果(以TCGA上用GATK3得到的VCF作为基准):
True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|
2168 | 547 | 209 | 0.7985 | 0.9121 | 0.8515 |
由于TCGA上用GATK3得到的VCF实际上不是金标准,下面这种表述方式会更好:
Overlap | V1 | V2 | Overlap Rate |
---|---|---|---|
2168 | 547 | 209 | 74.15% |
Overlap: Variants that are observed both in the VCF that GATK4 derived and in the VCF that TCGA derived using GATK3
V1: Variants that are observed in the VCF that GATK4 derived but not in the VCF that TCGA derived using GATK3
V2: Variants that are observed in the VCF that TCGA derived using GATK3 but not in the VCF that GATK4 derived
Overlap Rate = Overlap / (Overlap + V1 + V2)
根据以上的结果,可初步判断我的GATK4变异检测流程应大体搭建正确,其与TCGA上GATK3检测得到的VCF存在差异可能是由于软件版本不同以及PON不同等原因。
使用GATK3进行变异检测
由于TCGA数据库上用的Mutect2是旧版的(GATK3),故现在尝试也使用GATK3,看是否能复现结果。
GATK3也可在GATK官网中下载,下载下来得到一个jar包(不同于GATK4有一堆文件)。
值得注意的是,在GATK3的Mutect2中自带过滤功能,也就是说Mutect2得到的VCF的Filter域已经被标识了PASS或被过滤的原因。故使用GATK3时,不需要也不能够使用FilterMutectCalls,故GetPileupSummaries和CalculateContamination也是不需要的。可以在最后使用SelectVariants挑出PASS的变异,也可以省略此步。
#!/bin/bash
# Test the somatic calling pipeline using data from TCGA database.
# Bam files from TCGA have done "Markduplicates", indel realignment and BQSR.
# set some paths
tool_path="$HOME/diploma_project/Tools"
data_path="$HOME/diploma_project/Data"
gatk4="${tool_path}/gatk-4.0.11.0/gatk"
gatk3="${tool_path}/GenomeAnalysisTK-3.8-1-0/GenomeAnalysisTK.jar"
# buid_version="hg38"
reference="${data_path}/reference/hg38"
GATK_bundle="${data_path}/GATK_bundle/hg38_from_ck"
if [ $# -lt 4 ]; then
echo "usage: $0 sample_name normal_bam tumor_bam outdir"
exit 1
fi
sample=$1 #e.g. TCGA-A6-6650 (patient ID)
normal_bam=$2
tumor_bam=$3
outdir=$4
outdir=${outdir%/} #get rid of the last "/" if exist
if [ ! -d $outdir ]; then
mkdir $outdir
fi
if [ ! -d $outdir/chromosomes ]; then
mkdir $outdir/chromosomes
fi
echo -e "\n***Output to $outdir***\n"
echo -e "***Started at $(date +"%T %F")***\n"
chroms="chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY"
for i in $chroms; do
java -Xmx1g -jar $gatk3 \
-T MuTect2 \
-R $reference/Homo_sapiens_assembly38.fasta \
-L $i \
-I:tumor $tumor_bam \
-I:normal $normal_bam \
--normal_panel $GATK_bundle/somatic-hg38-1000g_pon.hg38.vcf.gz \
--cosmic $data_path/GATK_bundle/other_data/hg38_cosmic_data/Cosmic_coding_and_noncoding_hg38_sorted.vcf.gz \
--dbsnp $GATK_bundle/dbsnp_146.hg38.vcf.gz \
--contamination_fraction_to_filter 0.02 \
-o $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz \
--output_mode EMIT_VARIANTS_ONLY \
--disable_auto_index_creation_and_locking_when_reading_rods && \
echo -e "\n*** Mutect2 $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz done at $(date +"%T %F") ***\n" || exit 1 & #put the process background
done && wait #wait until all the process finish
merge_vcfs_cmd=""
for i in $chroms; do
merge_vcfs_cmd=${merge_vcfs_cmd}"-I $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz "
done && $gatk4 MergeVcfs ${merge_vcfs_cmd} -O $outdir/${sample}_somatic_filtered.vcf.gz && \
echo -e "\n*** MergeVcfs ${outdir}/${sample}_somatic_filtered.vcf.gz done at $(date +"%T %F") ***\n" || exit 1
## Select a subset of variants from a VCF file
$gatk4 SelectVariants \
-V $outdir/${sample}_somatic_filtered.vcf.gz \
-O $outdir/${sample}_somatic_filtered.PASS.vcf.gz \
--exclude-filtered && \
echo -e "\n*** SelectVariants: $outdir/${sample}_somatic_filtered.PASS.vcf.gz done at $(date +"%T %F") ***\n"
## Remove some useless files.
rm -r -f $outdir/chromosomes
若把TCGA数据库上用Mutect2得到的VCF当作金标准,将我写的GATK3脚本找到的变异与之比较,可得如下结果:
My GATK3 vs GATK3 in TCGA:
Sample | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|---|
LUSC | 974 | 55 | 1 | 0.9466 | 0.9990 | 0.9721 |
HNSC | 366 | 34 | 2 | 0.9150 | 0.9946 | 0.9531 |
SARC | 988 | 112 | 63 | 0.8982 | 0.9401 | 0.9186 |
将三个样本的数据综合起来,可得到我的GATK3的最终结果(以TCGA上用GATK3得到的VCF作为基准):
True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|
2328 | 201 | 66 | 0.9205 | 0.9724 | 0.9457 |
由于TCGA上用GATK3得到的VCF实际上不是金标准,可使用下面这种表述方式:
Overlap | V1 | V2 | Overlap Rate |
---|---|---|---|
2328 | 201 | 66 | 89.71% |
Overlap: Variants that are observed both in the VCF that my GATK3 derived and in the VCF that TCGA derived using GATK3
V1: Variants that are observed in the VCF that my GATK3 derived but not in the VCF that TCGA derived using GATK3
V2: Variants that are observed in the VCF that TCGA derived using GATK3 but not in the VCF that my GATK3 derived
Overlap Rate = Overlap / (Overlap + V1 + V2)
可见,我的GATK3得到的结果与TCGA上的十分接近,但由于PON的不同,结果有差别是不可避免的。由此结果可以肯定我搭建的GATK3流程基本正确。
利用TCGA上其他三个工具得到的VCF构建近似的金标准,用于比较GATK3与GATK4的性能
由上述结果可知,我搭建的GATK3和GATK4的流程应该基本正确。现综合TCGA上其他三个工具得到的VCF,规定被大于等于两个软件找出的变异为高度可信的变异,从而构建近似的金标准。用此金标准,比较TCGA上的GATK3,我的GATK3以及我的GATK4的性能。
构建金标准
VCF文件预处理:
VarScan2的VCF注释中有这一行:
“##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">”
而SomaticSniper的VCF注释中有类似的一行:
“##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">”
如果直接用CombineVariants合并两个VCF会报冲突错误。现手动将VarScan2的VCF的此行注释改成SomaticSniper的格式。
另外,来自SomaticSniper和VarScan2的VCF中的变异均为过滤后的变异,而来自Muse的VCF中的变异有PASS的也有被过滤掉的变异。应挑出PASS变异,使得后续合并后的VCF中只包含PASS的变异,以方便挑出出现在2个及以上VCF的变异。
## 代码示例
gatk SelectVariants \
-V '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/MuSE/488e3499-debf-4d33-8d09-4ce4e8c501a4.vcf' \
--exclude-filtered \
-O '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/MuSE/488e3499-debf-4d33-8d09-4ce4e8c501a4.PASS.vcf'
使用CombineVariants合并三个VCF:
注意:CombineVariants是GATK3的软件
## 代码示例
java -jar .../diploma_project/Tools/GenomeAnalysisTK-3.8-1-0/GenomeAnalysisTK.jar \
-T CombineVariants \
-R .../diploma_project/Data/reference/hg38/Homo_sapiens_assembly38.fasta \
--variant:VarScan2 .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/VarScan2/b25b354f-75a8-4309-9e9a-95b0cf7149af.vcf \
--variant:MuSE .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/MuSE/488e3499-debf-4d33-8d09-4ce4e8c501a4.PASS.vcf \
--variant:SomaticSniper .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SomaticSniper/14f0b05f-4b9f-4135-aa44-1f9bcc7fc59f.vcf \
-o .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SARC3_union3_PASS.vcf \
-genotypeMergeOptions PRIORITIZE \
-priority VarScan2,MuSE,SomaticSniper
PRIORITIZE这个模式的效果是当一个变异出现在多个文件中时,合并的VCF只会保留其中一条记录。具体保留哪条的优先顺序由-priority参数定义。这里将VarScan2的合并优先级定为最高是因为IMPACT数据库是用VarScan2找的变异,感觉它应该比较靠谱;SomaticSniper得到的VCF的Filtered域全为“.”,且变异记录是最多的,感觉不是很靠谱,故优先级定为最低。
挑出高度可信的变异:
gatk SelectVariants \
-V '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SARC3_union3_PASS.vcf' \
-select "set == 'VarScan2-SomaticSniper' || set == 'VarScan2-MuSE' || set == 'MuSE-SomaticSniper' || set == 'Intersection'" \
-O '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SARC3_union3_PASS_confident.vcf'
SelectVariants -select参数接受的是"JEXL expressions",教程见此。
GATK3 vs GATK4
Sample | Tool | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|---|---|
GATK3_TCGA | 842 | 133 | 178 | 0.8636 | 0.8255 | 0.8441 | |
LU | My_GATK3 | 880 | 149 | 140 | 0.8552 | 0.8627 | 0.8590 |
SC | My_GATK4 | 862 | 187 | 151 | 0.8217 | 0.8520 | 0.8366 |
My_GATK4.1 | 862 | 186 | 151 | 0.8225 | 0.8520 | 0.8370 | |
GATK3_TCGA | 273 | 95 | 110 | 0.7418 | 0.7128 | 0.7270 | |
HN | My_GATK3 | 283 | 117 | 100 | 0.7075 | 0.7389 | 0.7229 |
SC | My_GATK4 | 290 | 152 | 93 | 0.6561 | 0.7572 | 0.703 |
My_GATK4.1 | 290 | 151 | 93 | 0.6576 | 0.7572 | 0.7039 | |
GATK3_TCGA | 554 | 497 | 238 | 0.5271 | 0.6995 | 0.6012 | |
SA | My_GATK3 | 596 | 504 | 196 | 0.5418 | 0.7525 | 0.6300 |
RC | My_GATK4 | 610 | 614 | 179 | 0.4984 | 0.7740 | 0.6063 |
My_GATK4.1 | 610 | 616 | 179 | 0.4976 | 0.7740 | 0.6057 |
综合三个样本的结果如下:
Tool | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
---|---|---|---|---|---|---|
GATK3_TCGA | 1669 | 725 | 526 | 0.6972 | 0.7604 | 0.7274 |
My_GATK3 | 1759 | 770 | 436 | 0.6955 | 0.8014 | 0.7447 |
My_GATK4 | 1762 | 953 | 423 | 0.6490 | 0.8064 | 0.7192 |
My_GATK4.1 | 1762 | 953 | 423 | 0.6490 | 0.8064 | 0.7192 |
其中,My_GATK4的GetPileupSummaries步骤用的是原始的small_exac_common_3.hg38.vcf.gz
,而My_GATK4.1用的是自己构建的af-only-gnomad.hg38.SNP_biallelic.selected.exome.vcf.gz
,结果和之前一样,二者无显著差异。
另外,可见我搭建的GATK4效果相对最差。这样的结果也不算很奇怪,因为官网说GATK4很多功能还在测试开发当中,不适合用作生产。另外一种可能是我搭建的GATK4的参数还未调至最优,或者是这个近似金标准本身就不完全正确。
最后,可见我搭建的GATK3结果要比TCGA上的GATK3的结果要好。
最终的癌症病人基因突变分析流程可以使用GATK3来搭建。