输入文件格式转换
annovar目前支持以下格式的输入文件:
(1)Samtools genotype-calling pileup format, (2)Illumina CASAVA format,
(3)SOLiD GFF genotype-calling format,
(4)Complete Genomics variant format,
(5)SOAPsnp format,
(6)MAQ format,
(7)VCF format,这是最常用的类型。例如VCF_v4.0版本的格式:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
16 50745926 rs2066844 C T 80 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T G 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1230288 . T . 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
可以看出这个vcf是多个样本joint calling后的cvf,后续用annovar注释时需要根据是否注释每个位点而做不同处理。
使用如下命令转换格式:
[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput
WARNING to old ANNOVAR users: this program no longer does line-to-line conversion for multi-sample VCF files. If you want to include all variants in output, use '-format vcf4old' or use '-format vcf4 -allsample -withfreq' instead.
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 2 SNPs (1 transitions and 1 transversions) and 1 indels/substitutions for 1 sample (but input contains 3 samples)
WARNING: Skipped 1 invalid alternative alleles found in input file
[kaiwang@biocluster ~/]$ cat ex2.avinput
20 1110696 1110696 A G het 67 6
20 1110696 1110696 A T het 67 6
20 1234568 1234570 TCT - het 50 4
这样转换成annovar的input格式后,发现原先VCF中的INFO列不见了,如果需要保留的话,在命令行加上 -includeinfo 参数即可:
[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample -includeinfo
NOTICE: output files will be written to ex2.<samplename>.avinput
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 10 SNPs (6 transitions and 4 transversions) and 3 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file
[kaiwang@biocluster ~/]$ cat ex2.NA00001.avinput
20 1110696 1110696 A G 20 1110696 rs6040355 A G 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1110696 1110696 A T 20 1110696 rs6040355 A T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1234568 1234570 TCT - 20 1234567 microsat1 GTCT G 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4
细心的你应该发现了问题,原来的VCF中有7个位点,转换后只剩下了3个位点,这是因为:这个vcf文件是3个样本的结果,默认情况下只会输出第一个样本的结果,但是第一个样本的第1、2、3、5、6行的基因型都是 0|0,即没有发生变异,因此不输出;而第4行的基因型是1|2,发生了2个变异(20 1110696 rs6040355 A G,T );第7行的基因型是 0/1,发生1次变异(20 1234567 microsat1 GTCT G,GTACT 说明一下,在vcf中发生deletion时会把前一个碱基也输出,因此这里的G实际上是不存在的)。
如果想要把3个样本的结果都输出,加上 -allsample 参数:
[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2 -allsample
NOTICE: output files will be written to ex2.<samplename>.avinput
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 10 SNPs (6 transitions and 4 transversions) and 3 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file
[kaiwang@biocluster ~/]$ cat ex2.NA00001.avinput
20 1110696 1110696 A G het 67 6
20 1110696 1110696 A T het 67 6
20 1234568 1234570 TCT - het 50 4
[kaiwang@biocluster ~/]$ cat ex2.NA00002.avinput
16 50745926 50745926 C T het 80 8
20 14370 14370 G A het 29 8
20 17330 17330 T A het 3 5
20 1110696 1110696 A G het 67 0
20 1110696 1110696 A T het 67 0
20 1234567 1234570 GTCT GTACT het 50 2
[kaiwang@biocluster ~/]$ cat ex2.NA00003.avinput
16 50745926 50745926 C T hom 80 5
20 14370 14370 G A hom 29 5
20 1110696 1110696 A T hom 67 4
20 1234568 1234570 TCT - hom 50 3
最后,如果想要最为全面的信息,如下:
[kaiwang@biocluster ~/]$ convert2annovar.pl -format vcf4 example/ex2.vcf -outfile ex2.avinput -allsample -withfreq -include
NOTICE: Finished reading 25 lines from VCF file
NOTICE: A total of 7 locus in VCF file passed QC threshold, representing 6 SNPs (3 transitions and 3 transversions) and 3 indels/substitutions
NOTICE: Finished writing 17 SNPs (8 transitions and 9 transversions) and 6 indels/substitutions for 3 samples
WARNING: Skipped 1 invalid alternative alleles found in input file
[kaiwang@biocluster ~/]$ cat ex2.avinput
16 50745926 50745926 C T 0.5 80 5 16 50745926 rs2066844 C T 80 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 14370 14370 G A 0.5 29 5 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 17330 T A 0.1667 3 3 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 1110696 A G 0.5 67 0 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1110696 1110696 A T 0.6667 67 4 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 1230237 T G 0 47 2 20 1230237 . T G 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1230288 1230288 T 0 0 50 2 20 1230288 . T . 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234568 1234570 TCT - 0.75 50 3 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
20 1234567 1234570 GTCT GTACT 0.5 50 2 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
原文传送:
http://annovar.openbioinformatics.org/en/latest/user-guide/input/