https://broadinstitute.github.io/picard/picard-metric-definitions.html
https://broadinstitute.github.io/picard/index.html
picard 是一套命令行组成的工具包,用于处理高通量数据以及SAM/bam/VCF等相关数据格式。相关文件格式见说明 Hts-specs, SAM specification and the VCF specification.
使用方法:
java jvm-args -jar picard.jar PicardToolName OPTION1=value1 OPTION2=value2...
所有工具
- AlignmentSummaryMetrics: 统计比对结果(SAM/BAM), 由CollectAlignmentSummaryMetrics生成,结果在文件.alignment_summary_metrics中。
ClusteredCrosscheckMetric: 处理聚类的 crosschecking fingerprints结果*
CollectHiSeqXPfFailMetrics.PFFailDetailedMetric: a metric class for describing FP failing reads from an Illumina HiSeqX lane *
CollectHiSeqXPfFailMetrics.PFFailSummaryMetric: Metrics produced by the GetHiSeqXPFFailMetrics program.*
CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.*
CollectQualityYieldMetrics.QualityYieldMetrics: 用于描述 BAM 比对质量的一些指标。*
CollectVariantCallingMetrics.VariantCallingDetailMetrics: 给定文件的 VCF 文件,与 SNP 和 Indel 相关的指标。*
CollectVariantCallingMetrics.VariantCallingSummaryMetrics: 同上*
CollectWgsMetrics.WgsMetrics: 用于评估全基因组测序结果。*
CollectWgsMetricsWithNonZeroCoverage.WgsMetricsWithNonZeroCoverage: 同上*
CrosscheckMetric: 处理 crosschecking fingerprints结果*
DuplicationMetrics: 对 SAM 标记 duplicates,并计算相关指标。*
ErrorSummaryMetrics: CollectSequencingArtifactMetrics 计算的summary 指标,计算每种碱基错误率。*
ExtractIlluminaBarcodes.BarcodeMetric:
ExtractIlluminaBarcodes计算的指标,分析 Basecalling 目录下的数据,确定每个reads 和 barcode 的关系。*FingerprintingDetailMetrics: fingerprint 内,单个 SNP/杂合体 比较的详细指标。*
FingerprintingSummaryMetrics: 总结 fingerprinting 指标,统计比较测序数据。*
GcBiasDetailMetrics:
Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.*-
GcBiasMetrics: *
GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.*
GenotypeConcordanceContingencyMetrics: Class that holds metrics about the Genotype Concordance contingency tables.*
GenotypeConcordanceDetailMetrics: Class that holds detail metrics about Genotype Concordance*
GenotypeConcordanceSummaryMetrics: Class that holds summary metrics about Genotype Concordance*
HsMetrics:Metrics generated by CollectHsMetrics for the analysis of target-capture sequencing experiments.
-
IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.*
IlluminaLaneMetrics: Embodies characteristics that describe a lane.*
IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.*
IndependentReplicateMetric: A class to store information relevant for biological rate estimation*
InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".*
JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".*
MendelianViolationMetrics: Describes the type and number of mendelian violations found within a Trio.*
MergeableMetricBase: An extension of MetricBase that knows how to merge-by-adding fields that are appropriately annotated.*
MultilevelMetrics: *
RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".*
RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC*
RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC*
SequencingArtifactMetrics.BaitBiasDetailMetrics: Bait bias artifacts broken down by context.*
SequencingArtifactMetrics.BaitBiasSummaryMetrics: Summary analysis of a single bait bias artifact, also known as a reference bias artifact.*
SequencingArtifactMetrics.PreAdapterDetailMetrics: Pre-adapter artifacts broken down by context.*
SequencingArtifactMetrics.PreAdapterSummaryMetrics: Summary analysis of a single pre-adapter artifact.*
TargetedPcrMetrics: Metrics class for the analysis of reads obtained from targeted pcr experiments e.g.*
UmiMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords using the UmiAwareDuplicateSetIterator.
-
IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.*
详细功能
CollectHsMetrics:
分析靶向测序(hybrid-selection)的相关指标
该命令读取SAM/BAM文件。HS(杂交捕获,靶向测序,hybrid-selection)是靶向测序常用的技术,如外显子靶向测序,更多信息参考GATK Dictionary entry.
该命令需要
1)比对结果(SAM/BAM)
2)捕获位点信息(靶向试剂盒生产商提供)。若捕获位点是 bed 格式,则需要 BedToInterval 工具转换为 picard 所需的 interval_list 格式。
3)若有参考序列,则会同时计算 AT_DROPOUT and GC_DROPOUT 指标。
因为某些区域GC含量过多或过少,会使测序错误率增加,然后导致比对到这些区域的reads变少,即比对效率降低,覆盖度降低。
你可以使用 PER_TARGET_COVERAGE,获取每个捕获位点的GC含量和测序深度等信息。
标为 pct 的指标都是比例。
java -jar picard.jar CollectHsMetrics \
I=input.bam \
O=hs_metrics.txt \
R=reference_sequence.fasta \
BAIT_INTERVALS=bait.interval_list \
TARGET_INTERVALS=target.interval_list
# BAIT_INTERVALS 可以与 TARGET_INTERVALS 相同(但我还不太明白)
bait 与 target 区别:
计算 bait coverage 时,很少去除 reads,因此可以直观感受湿实验效果,但是计算 target coverage 时,因为对突变检测的贡献有限,去除了很多碱基。可以看一下各种 PCT_EXC 指标的描述,为什么在计算 target 时,过滤掉很多 reads 。大部分过滤条件可以通过参数调节。
详细的结果说明查看 CollectHsMetrics
CollectHsMetrics 分析的指标分为三类。
1 ) 基本测序指标,用来计算其他指标。比如基因组大小,reads 总数,比对的 reads 总数。
bait_set: 捕获杂交用的 bait 名称
bait_territory:位于一个或多个 bait位点的碱基数量
target_territory:覆盖在target区域 unique base数量
bait_design_efficiency:设计效率。 target_territory/bait_territory 比例。值为1 表示设计效率极好,0.5表示一半 bait 碱基不在taget区域。
PF_READS:通过vendor's 过滤的reads总数。
PF_BASES_ALIGNED :通过碱基质量控制(PF),且比对到基因组(比对分值>0)上 unique 碱基。
on_bait_bases: 比对到基因组 bait 区域的 (PF_BASES_ALIGNED )碱基数量。
genome_size
total_reads: SAM 文件中 reads 总数。
pf_reads:通过平台/vendor 质控的 reads 总数。
pf_bases:PF_READS 的碱基量。
pf_unique_reads:非重复 reads
pf_uq_reads_aligned: 比对reads中 unique 比例
pf_bases_aligned:比对上的碱基总数。
pf_uq_bases_aligned: 比对 reads 中 unique reads 的碱基总数
on_target_bases: 比对到 target 区域的碱基总数
pct_pf_reads:下机数据中通过质控的 reads 比例。
pct_pf_uq_reads:下机数据中通过质控且无重复的 reads 比例
pct_pf_uq_reads_aligned:通过质控的reads中,比对到reference 的无重复 reads 比例
2 ) 实验质量,比如比对到 bait 附近、内部、外部的碱基数量或比例, fold 80 碱基罚分,捕获文库大小,捕获罚分。在过滤之前得到这些指标,比如低比对质量,低质量碱基,重复reads。
near_bait_bases:比对到 bait 附近的 reads 碱基量。 即有部分重叠。
off_bait_bases:没有比对到 bait 区域的碱基量。
pct_selected_bases:(near_bait_bases+on_bait_bases)/PF_BASES_ALIGNED
pct_off_bait:off_bait_bases/PF_BASES_ALIGNED。
on_bait_vs_selected:on-taget 中bait 完全覆盖的比例。
fold_80_base_penalty:测序均一度指标,非0覆盖区域上,使80%碱基达到平均coverage时,需要另外测序的倍数。值越低越好,最好值为1。
hs_library_size:被捕获的文库片段数量估计值
hs_penalty_10x:80% 靶向区区域碱基达到 10X时的捕获罚分。即:当设计10M的靶向区域时,要得到 10X coverage, 需要测序,直到 PF_ALIGNED_BASES =10^7 * 10 * HS_PENALTY_10X.
hs_penalty_20x:想要80%区域到达 20X coverage。
hs_penalty_30x
hs_penalty_40x
hs_penalty_50x
hs_penalty_100x
3)target 覆盖度评估,评估下游分析中的可靠性。比如target 区域平均覆盖度,不同覆盖度水平的碱基比例,不同条件过滤的碱基比例。按照所有条件过滤后计算这些指标。
mean_bait_coverage:所有 bait 位点上的平均覆盖度。
pct_usable_bases_on_bait: 可使用的 PF 碱基中,比对到 bait 上的去重的碱基数量。
pct_usable_bases_on_target: 可使用的 PF 碱基中,比对到 target 上的去重的碱基数量。
fold_enrichment:扩增区域被扩增的倍数
mean_target_coverage: target 区域平均覆盖度。
median_target_coverage:覆盖度
max_target_coverage:覆盖度
min_target_coverage:覆盖度
zero_cvg_targets_pct:target 区域覆盖度<1的比例。
不同条件过滤的碱基比例:
pct_exc_dupe:标记为重复的 reads 。
pct_exc_adapter:adapter
pct_exc_mapq:低比对质量
pct_exc_baseq: 低碱基碱基。
pct_exc_overlap: 重复序列比例。 the second observation from an insert with overlapping reads. ???
pct_exc_off_target: 比对到 taget 区域外。
不同覆盖度水平的碱基比例:
pct_target_bases_1x:比对到target 区域的,不小于 1X的碱基比例
pct_target_bases_2x
pct_target_bases_10x
pct_target_bases_20x
pct_target_bases_30x
pct_target_bases_40x
pct_target_bases_50x
pct_target_bases_100x
at_dropout:与平均覆盖度相比,低碱基含量(GC<50%)的区域,偏低的程度。结果是个比值,表示总reads中比对到 低 GC含量区域的比例。
gc_dropout:高 GC含量的区域上 reads 比例。
het_snp_sensitivity:HET SNP 理论值。
het_snp_q:HET SNP 理论值的 Q 值,
sample
library
read_group