GWAS imputation是什么?
- Genotype imputation 是运用连锁不平衡的原理依据一个高密度的参考基因组填补要研究数据的一种方法。
- 常用的参考基因组数据库包括The HapMap Consortium database (International HapMap 3 Consortium, 2010)、the 1000 Genomes Project (1KG; 1000 Genomes Project Consortium,2012)。千人基因组计划涵盖了世界各地的1,092个人的全基因组信息 (其中181 samples from Admixed American, 246 from African, 286 from East Asian, and 379 from European ancestry groups)。
原理如下图:
image.png
image.png
为什么要做GWAS imputation?
常用的GWAS芯片大约60万个位点,经过质控后大约只剩下30多万个位点,对于全基因组30亿个碱基来说,只覆盖了全基因组万分之一的区域,因此大片的区域为空白。经过基因型填补后,SNP密度大大增加,如果与表型相关联的位置没有SNP,填补前是没有显著性的,填补后则有可能出现显著性。因此,基因型填补可以大大增加GWAS的统计效能。
GWAS imputation 常用软件
- Minimac, IMPUTE2,PLINK, BEAGLE, MaCH+minimac, fastPHASE
GWAS imputation 步骤
image.png
1、在线资源
- 很多在线的GWAS imputation服务网站,上传数据即可,如https://imputationserver.sph.umich.edu/index.html#!pages/help
https://imputation.sanger.ac.uk/
2、本地运行
imputation一般步骤:
- Liftover PED/MAP files from build 36 to build 37:由于基因组版本的问题,数据中SNP位点的位置在不同的版本中可能有差异,因此要转换为统一的版本以便后续比对。一般被称为pre-processing过程。
具体包括:
- Create a bed file based on MAP file.
- Map the bed file to build 37 using UCSVC liftover tool.
- Create list of unmapped SNPs.
- Create plink mappings file.
- Create new plink data without unmapped SNPs using Plink.
- Quality control of study data:对原始数据进行质控(Quality control)提高填补的准确性
具体包括:
- 将要填补的数据与参考基因组比对,看SNP在正链还是负链上,过滤掉翻转的SNP
- 要求SNP符合哈迪温伯格平衡>10-4, MAF>0.01, call rate>0.95。不符合条件的SNP被过滤掉。
- 检查SNP是否在参考基因组中,有的参考基因组中没有这个SNP,也会被过滤掉
- 检查位点在要填补的数据和参考基因组中的频率是否差异过大,如果差异超过25%该SNP将会被过滤掉
- 比较数据和参考基因组之间的haplotype结构差异(r-squared > 0.1的SNP),差异过大的SNP被过滤掉。
- Phasing the study data
具体包括:
- Gather the map files as provided in the ALL_1000G_phase1integrated_v3_impute.tgz from the impute website.
- Phase the study data using the map files.
- Imputing study data
- 将数据按染色体分开,用 Impute2软件填补经过phased的数据, 一般填补最小分辨率为5 million base。
- 将数据合并到一起
流程:
由于以上步骤很多,为了方便imputation过程,很多人开发了自动化的pipeline,如
- https://github.com/CNSGenomics/impute-pipe
- https://github.com/pgxcentre/genipe
- http://www.arrayserver.com/wiki/index.php?title=Genetics_GwasImputePipeline.pdf
- https://github.com/molgenis/molgenis-pipelines
step1: QC
##QC1: Remove SNPs with missing genotype rate >5%
${plink} --bfile ${outpath4step}/${name} --missing --out ${outpath4step}/${name}.plink
awk 'NR < 2 { next } $5 > 0.05' ${outpath4step}/${name}.plink.lmiss > ${outpath4step}/${name}.plink.lmiss.exclude
${plink} --bfile ${outpath4step}/${name} \
--exclude ${outpath4step}/${name}.plink.lmiss.exclude \
--make-bed --out ${outpath4step}/${name}.plink_QC1
##QC2 :Remove SNPs deviated from HWE (10e-4)
${plink} --bfile ${outpath4step}/${name}.plink_QC1 \
--hardy --out ${outpath4step}/${name}.plink_QC1
awk '$3=="UNAFF" && $9 <0.0001 {print $2}' ${outpath4step}/${name}.plink_QC1.hwe \
> ${outpath4step}/${name}.plink_QC1.hwe.exclude
${plink} --bfile ${outpath4step}/${name}.plink_QC1 \
--exclude ${outpath4step}/${name}.plink_QC1.hwe.exclude \
--make-bed --out ${outpath4step}/${name}.plink_QC2
##QC3 :Remove MAF>0.01
${plink} --bfile ${outpath4step}/${name}.plink_QC2 \
--maf 0.01 \
--make-bed \
--out ${outpath4step}/${name}.plink_QC3
Step2: Alignment of the SNPs
# make sure that the GWAS dataset is well aligned with the reference panel of haplotypes
# if not, use the UCSC liftOver tool to perform the conversion to build37 coordinates
# download 1000 genome refernce file
for i in {1..22}
do
nohup wget -c -q http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz &
nohup wget -c -q http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi &
done
# Combine all chromosomes in a single file
bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-Oz -o ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --threads 48
# Preparing a VCF file
bgzip -c example.vcf > example.vcf.gz
tabix -p vcf example.vcf.gz
# convert VCF to PLINK
# To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed.
# remove all duplicate entries.
# Catalogue duplicate SNP id:
grep -v '^#' <(zcat ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ALL.autosomes.dups
# Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:
bcftools norm -d both -m +any -Ob ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --threads 64 \
| plink --bcf /dev/stdin --make-bed --out ALL.autosomes --allow-extra-chr 0 --memory 60000 --exclude ALL.autosomes.dups
java -jar GenotypeHarmonizer.jar \
--inputType PLINK_BED \
--input /hapmap3CeuChr20B37Mb6RandomStrand \
--update-id \
--outputType PLINK_BED \
--output /all_chrs \
--refType VCF \
--ref /ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
step3.pre-phasing using SHAPEIT2
shapeit -B all_chrs \
-M /home/zhouwei/zhou_data/GWAS/imputation/impute-pipe/data/reference/ALL_1000G_phase1integrated_v3_impute/genetic_map_chr20_combined_b37.txt \
-O phased_all_chrs -T 64
Step 4: Imputation into pre-phased haplotypes
impute2 -use_prephased_g -Ne 20000 -iter 30 -align_by_maf_g -os 0 1 2 3 -seed 1000000 \
-o_gz -int 1 5e6 \
-h /ALL_1000G_phase1integrated_v3_chr20_impute.hap.gz \
-l /ALL_1000G_phase1integrated_v3_chr20_impute.legend.gz \
-m /genetic_map_chr20_combined_b37.txt \
-known_haps_g phased_all_chrs.haps \
-o /chr20.chunk1
impute2 \
-use_prephased_g -Ne 20000 -iter 30 -align_by_maf_g -os 0 1 2 3 -seed 1000000 -o_gz -int 5e6 10e6 \
-h /ALL_1000G_phase1integrated_v3_chr20_impute.hap.gz \
-l /ALL_1000G_phase1integrated_v3_chr20_impute.legend.gz \
-m /genetic_map_chr20_combined_b37.txt \
-known_haps_g phased_all_chrs.haps \
-o chr20.chunk2
Step 5: Combine all the chunks
cat chr22.chunk1.gz chr22.chunk2.gz > chr22_chunkAll.gen.gz
待完善
参考:
http://www.bbmriwiki.nl/wiki/Impute2Pipeline
http://databeauty.com/blog/tutorial/2017/02/20/GWAS-prephasing-and-imputation.html