文中选择了12个特征来进行分类学习预测,这些特征可分为四类,分别为read depth
, base quality
, mapping/alignment quality
, strand bias
第一类: Read Depth
Features under this category measure the absolute depth and depth ratio of reads that are “effective” to be a specific candidate variant. “Effective” means that the read shares the same base as the candidate variant at the candidate’s locus.
effective base depth
Effective Base Depth (EBD) is the sum of the depths of effective reads. For indel reads, the EBD equals the mapping quality, while for SNV reads, the EBD is the value of the mapping quality multiplied by the base quality.
effective base depth ratio
The EBD ratio, i.e., the EBD of one candidate variant divided by the sum of the EBDs of all candidate variants at that locus. If this indicator is very low, the related candidate variant tends to be a random error.
DeltaL is a statistic describing the difference between optimal and suboptimal genotypes. Fuwa first hypothesizes that the variant is true, so the reads covering this locus obey an almost ideal variant model: 0/1 or 1/1. The logarithms of likelihood under these two ideal models are calculated separately, and the
bigger one is selected as L1. Then, Fuwa calculates the second likelihood logarithm, L2, under another hypothesis that the variant is false and that reads covering this locus follow the binomial distribution model. Thus, L1-L2, or DeltaL, is the logarithm of the ratio of the first and second likelihoods. If DeltaL is close to 0, which means the likelihoods of the ideal model and the binomial model are nearly equal, we empirically judged the variant to be false positive; otherwise, the variant tends to be true.
第二类: Base quality
This category focuses on the accuracy of a base sequenced by the sequencing machine, which has considerable impact on variant calling.
Sum of Base Quality (SumBQ)
This feature is the sum of the base quality of effective reads for one candidate variant. For indel reads, this value is set to 30 empirically.
Average Mapping Quality (AveBQ)
By dividing SumBQ by the number of effective reads, we obtain the average mapping quality.
Variance of Position (VarPos)
Here, “position” means the offset of the pile-up site from the 3′end of a read. We use this statistic considering that, generally, sequencing quality declines towards the end of a read; thus, candidate variants that are close to the 3′ end are more likely to be sequencing errors.
第三类: Mapping/alignment quality
This category considers how well a read is mapped and aligned to its current locus. Mismatches lead to a higher possibility of false positives.
Average Mapping Quality (AveMQ)
The average of the mapping quality of effective reads at the candidate variant’s locus.
Worst Mapping Quality (WorMQ)
The worst mapping quality of all reads at the candidate variant’s locus.
Poor Mapping Quality Ratio (PoorMQR)
The ratio of reads with mapping quality lower than 15 at the candidate variant’s locus.
Average Alignment Score (AveAS)
The alignment score is a different metric than mapping quality, and its computing methods vary from aligner to aligner. Briefly speaking, the alignment score measures the similarity between a read and the reference genome, while mapping quality reflects the specificity that a read tends to be mapped to its current locus instead of other loci. AveAS is the average of the alignment scores of all reads at the candidate variant’s locus.
Alignment Score 是一个与Mapping Quality不同的概念,MQ可以说是一个类似于概率的评估指标,由reads比对到当前位置的错配碱基的质量值计算而来(具体见wiki MQ),而Alignment Score则是用来评估这条reads与参考基因组相似度的一个参数。
AveAS 则是支持此变异位点的所有序列的Alignment Score的均值。
第四类: Strand Bias
This category assumes that effective reads of true positives from positive and negative strands of DNA should be approximately equal.
Variance of Strands (VarStr)
Assuming that the numbers of effective reads from positive/negative strands obey the binomial distribution, the variance can be calculated through the formula D(n) = np(1-p). If VarStr is small, it means that reads of the candidate variant cluster in one direction, suggesting a sequencing error or other false positive situations.
假设测序时候正负链是没有分别的,那么他们的概率就应该都为P=0.5,而D(n) = np(1-p),其中p表示正链或则负链(因为两者是互斥关系),D(n)表示支持某一变异位点的reads的方差,明显当正负链被测到的概率相同时(P=0.5)D(n)最大,表示没有因为机器原因出现了链差异性,而当正链或者负链数量明显偏多时,D(n)就会非常小,这时认为出现了链偏好性,变异更可能是假阳性的。
Bias of Strands (BiasStr)
BiasStr is a χ2 value measuring the significance of correlation between “whether a read is effective” and the direction of strand that the read comes from. It is calculated by using a 2 × 2 contingency table.
image.pngimage.pngwhere n = a + b + c + d.
If BiasStr is too high, which means the effective reads of the candidate variant cluster in one strand, the candidate tends to be caused by sequencing error.
这个特征利用卡方检验来检验一条reads是否为“effective reads”与其是正链还是负链间的关系。如果x2值很大,则表明这歌变异更可能假阳性的。
A study on fast calling variants from next generation sequencing data using decision tree.