综合性突变危害性预测软件

基于测序数据得到的候选变异,如何判定突变是否有害呢?准确区分中性突变与致病突变对遗传病的临床检测有着重要的意义,研究表明,对于单个样本的外显子数据,即使过滤了群体频率(小于1%)与功能,最终仍然有近~400左右的非同义罕见突变位点[1,2],因此若能对突变进行精确的危害性预测,从大量候选突变中鉴定出致病突变将很大程度辅助临床上对遗传病进行确切诊断及早期干预。

目前已经有多个突变的危害性预测软件开发文章发表,dbNSFP是一个不断更新的对人类非同义突变位点(nsSNVs)注释的工具,目前已收录84,013,490 nsSNVs位点和剪切位点ssSNVs (splicing-site SNVs)。根据最新的dbNSFP v4.0版本,其收录了29个危害性预测软件如SIFT, SIFT4G, Polyphen2-HDIV, Polyphen2-HVAR, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, CADD, VEST4, PROVEAN, FATHMM-MKL coding, FATHMM-XF coding, fitCons, LINSIGHT, DANN, GenoCanyon, Eigen, Eigen-PC, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, GEOGEN2, ALoFT和9个保守型的软件 如PhyloP x 3, phastCons x 3, GERP++, SiPhy, bStatistic。其他的注释信息包括群体频率如千人基因组1000 Genomes Project phase 3 data, 英国万人基因组UK10K cohorts data, ExAC consortium数据, gnomAD data和ESP6500 数据, 还包括其他一些基因水平的注释。dbNSFP可以方便于对位点水平的注释,同时我们也看到目前至少已有超过40多个位点的危害性预测工具。

按照Kai Wang和Xiaoming Liu[3](也是dbNSFP工具的作者)对危害性预测软件的分类,从预测原理及预测方法上区分,其主要基于:

  • 蛋白质功能的改变:主要是突变引起蛋白质空间构象改变,进一步造成生理功能发生有害的变化,如PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT等。
  • 进化保守性:主要是对多个物种核酸序列或蛋白序列进行多序列比对,分析同源序列的多态性,如GERP++, SiPhy和PhyloP等。
  • 综合性软件:主要是结合多个预测软件的结果,同时收集相关特征信息,利用机器学习等相关算法结合突变的多维特征训练模型进行预测,如CADD, DANN,MetaSVM, MetaLR,CONDEL, M-CAP, REVEL等。

综合性软件由于其结合了多个软件的结果,并基于了一定的算法与特征,因此提升了对突变致病性判断的准确度和灵敏度。近年来许多类似开发的相关软件发表,总结如下:

名称 网站 发表时间 特征/学习 训练集 算法
VEST http://karchinlab.org/apps/appVest.html 28-May-13 The full set of 86 features for VEST classifier construction. ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. supervised machine learning algorithm, Random Forest
CADD http://cadd.gs.washington.edu/ 2-Feb-14 63 annotations including 949 sequence features 13,141,299 SNVs, 627,071 insertions and 926,968 deletions from both the simulated variant and observed variant data sets. support vector machine(SVM)
DANN https://cbcl.ics.uci.edu/public_data/DANN/ 22-Oct-14 同CADD 同CADD deep neural network (DNN).
MetaSVM doi: 10.1093/hmg/ddu733 22-Dec-14 nine scores (SIFT, PolyPhen-2, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy and PhyloP), along with allele frequency observed in diverse populations of the 1000 Genomes project. Training dataset included 14 191 deleterious mutations, which were annotated as causing Mendelian disease and 22 001 neutral mutations, which were annotated as not known to be associated with any phenotypes, all based on Uniprot annotation. support vector machine(SVM)
MetaLR doi: 10.1093/hmg/ddu733 22-Dec-14 nine scores (SIFT, PolyPhen-2, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy and PhyloP), along with allele frequency observed in diverse populations of the 1000 Genomes project. Training dataset included 14 191 deleterious mutations, which were annotated as causing Mendelian disease and 22 001 neutral mutations, which were annotated as not known to be associated with any phenotypes, all based on Uniprot annotation. logistic regression (LR)
Eigen http://www.columbia.edu/~ii2135/eigen.html 4-Jan-16 protein function scores (SIFT, PolyPhen), and Mutation Assessor. Evolutionary conservation scores (GERP_NR and GERP_RS5); PhyloP primate (PhyloPri), placental mammal (PhyloPla) and vertebrate (PhyloVer). Allele frequencies in four populations (African (1-AF_AFR), European (1-AF_EUR), East Asian (1-AF_ASN) and admixed American (1-AF_AMR)) were obtained from the 1000 Genomes Project (November 2014). the training data on ~76.7 million coding nonsynonymous variants an unsupervised approach to integrate these different annotations into one measure of functional importance
IMHOTEP http://www.uni-kiel.de/medinfo/cgi-bin/predictor/ 26-Sep-16 integrated nine popular prediction tools (PolyPhen-2, SNPs&GO, MutPred, SIFT, MutationTaster2, Mutation Assessor and FATHMM as well as conservationbased Grantham Score and PhyloP) into a single predictor. 10 029 disease causing single nucleotide variants (SNVs) from Human Gene Mutation Database and 10 002 putatively‘benign’ non synonymous SNVs from UCSC random forest,decision tree or logistic regression analysis.
REVEL https://sites.google.com/site/revelgenomics/ 6-Oct-16 a total of 18 individual pathogenicity prediction scores from 13 tools as predictive features. MutPred, FATHMM, VEST, Poly-Phen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. Human Gene Mutation Database (HGMD) version 2015.2 and the Exome Sequencing Project (ESP) European-American and African-American populations, the Atherosclerosis Risk in Communities (ARIC) study European-American and African American populations, and the 1000 Genomes Project (KGP) European, Yoruban, and Asian populations. The final training set consisted of 6,182 HGMD disease variants and 123,706 rare neutral ESVs. Random Forest
M-CAP http://bejerano.stanford.edu/MCAP/ 24-Oct-16 It uses nine established pathogenicity likelihood scores: SIFT,PolyPhen-2, CADD, MutationTaster, MutationAssessor,FATHMM, LRT, MetaLR, and MetaSVM. It also incorporates seven established measures of base-pair, amino acid, genomic region,and gene conservation: RVIS, PhyloP, PhastCons, PAM250, BLOSUM62, SIPHY, and GERP. In addition,M-CAP introduces 298 new features derived from multiple-sequence alignment of 99 primate, mammalian, and vertebrate genomes to the human genome. HGMD Pro 2015.2(pathogenic) and ExAC v3 (benign),12,418 rare, missense pathogenic variants and 3,137,919 rare, missense benign variants gradient boosting tree
DEOGEN2 https://deogen2.mutaframe.com/ 26-Apr-17 PROVEAN score,Conservation Index,Mutant/wildtype log-odd ratio,Early Folding predictions New EF EF,PFAM log-odd score New PF PF,Interaction patches annotation New INT IN,RVIS New RVIS RV,GDI New GDI GD,Recessiveness index From version 1 REC RE,Gene essentiality From version 1 ESS ES,Pathway log-odd score February 2016 version of Humsavar. 27 606 deleterious SNVs and 38 285 neutral SNVs retained. the scikit-learn implementation of a Random Forest classifier with 200 trees.
MutPred http://mutpred.mutdb.org/ 9-May-17 extracted 1,345 (including 20 optional) features.These features are subcategorized into six groups: (1) sequence-based features, (2) substitution-based features, (3) position-specific scoring matrix-based features, (4) conservationbased features, (5) homolog profiles (optional due to time necessary to compute), and (6) changes in predicted structural and functional properties. It is trained on a set of 53,180 pathogenic and 206,946 unlabeled (putatively neutral) variants obtained from the Human Gene Mutation Database (HGMD), SwissVar, dbSNP and inter-species pairwise alignment. a bagged ensemble of 30 feed-forward neural networks
ALoFT http://aloft.gersteinlab.org/ 29-Aug-17 108 features to train model,The main features of ALoFT include (1) functional domain annotations; (2) evolutionary conservation; and (3) biological networks. used three classes of premature stop variants as training data: benign variants, dominant disease-causing variants, and recessive disease-causing variants. The benign set includes homozygous premature stop variants discovered in a cohort of 1092 healthy people, Phase1 1000 Genomes data (1KG).Homozygous premature stop mutations from HGMD that lead to recessive disease and heterozygous premature stop variants in haplo-insufficient genes that lead to dominant disease represent the two disease classes. random forest algorithm
MVP https://github.com/ShenLab/missense 2-Feb-18 38 features used in constrained model, 21 features used in non-constrained model 22,390 missense mutations from Human Gene Mutation Database Pro version 2013 (HGMD) database under the disease mutation (DM) category, 12,875 deleterious variants from UniProt and 4,424 pathogenic variants from ClinVar database as true positive(TP). In total, there are 32,074 unique positive training variants. The negative training sets include 5,190 neutral variants from Uniprot randomly selected 42,415 rare variants from DiscovEHR database, and 39,593 observed human-derived variants. In total, there are 86,620 unique negative training variants deep residual neural network model (ResNet)
ClinPred https://sites.google.com/site/clinpred/ 13-Sep-18 16 individual prediction scores from SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, LRT, MutationAssessor,PROVEAN, CADD, GERP, DANN, PhastCons, fitCons, PhyloP,and SiPhy.Allele frequencies (AFs) of each variant in different populations were obtained from the gnomAD database ClinVar database dated January 2016;11,082 variants, with 7,059 labeled as benign and 4,023 labeled as pathogenic random forest (cforest) and gradient boosted decision tree (xgboost)
PrimateAI https://github.com/Illumina/PrimateAI 17-Dec-18 The total size of the network, with protein structure included, is 36 layers of convolutions, consisting of roughly 400,000 trainable parameters Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD);~380,000 common missense variants from humans and six non-human primate species, using a semi-supervised benign vs unlabeled training regimen deep neural networks

从上述总结中,可发现综合性软件的开发从传统的机器学习算法到现在比较火的深度学习应用上,每年都会有新的软件基于不同的特征与训练集开发的软件报道;同时我们也可看出对于危害性预测软件,其准确性都有着一定的波动性,目前也有许多文章评测了各种软件的效果[4,5,6],这种准确性波动的原因可能受到位点异质性的影响,为了降低这种异质性,提升危害性预测软件的准确性,以更为具体的疾病,基因或通路信息研究是目前危害性预测软件提升的一个方向,下节将分享一篇最新发表的疾病特异性的预测软件。无论如何,在使用这类软件时需注意,根据ACMG遗传变异分类标准与指南,“在解读中,不同软件工具组合的预测结果被视为单一证据而不是相互独立的证据。因为每个软件工具基于他们使用的算法都各有优缺点,所以仍然建议使用多种软件进行序列变异解读; 很多情况下,预测性可能因为基因和蛋白质序列的不同而有差异。无论如何,这些软件分析结果只是预测,他们在序列变异解读中的应用应该慎重。不建议仅使用这些预测结果作为唯一证据来源进行临床判断”

参考文献
  1. Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., and McVean, G.A.; 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1,092 human genomes.Nature 491, 56–65.
  2. Tennessen, J.A., Bigham, A.W., O’Connor, T.D., Fu,W., Kenny, E.E., Gravel, S., McGee, S., Do, R., Liu, X., Jun, G., et al.; Broad GO; Seattle GO; NHLBI Exome Sequencing Project (2012).Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69.
  3. Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K* and Liu X*. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human Molecular Genetics 24(8):2125-2137.
  4. Korvigo I, Afanasyev A, Romashchenko N, et al. Generalising Better: Applying Deep Learning To Integrate Deleteriousness Prediction Scores For Whole-Exome SNV Studies[J]. bioRxiv, 2017: 126532.
  5. Mahmood K, Jung C, Philip G, et al. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics[J]. Human Genomics, 2017, 11(1): 10
  6. Zhou Y, Fujikura K, Mkrtchian S, et al. Computational methods for the pharmacogenetic interpretation of next generation sequencing data[J]. Frontiers in pharmacology, 2018, 9: 1437.
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 211,423评论 6 491
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,147评论 2 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,019评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,443评论 1 283
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,535评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,798评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,941评论 3 407
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,704评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,152评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,494评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,629评论 1 340
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,295评论 4 329
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,901评论 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,742评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,978评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,333评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,499评论 2 348