picard 使用

https://broadinstitute.github.io/picard/picard-metric-definitions.html
https://broadinstitute.github.io/picard/index.html
picard 是一套命令行组成的工具包，用于处理高通量数据以及SAM/bam/VCF等相关数据格式。相关文件格式见说明 Hts-specs， SAM specification and the VCF specification.

使用方法：

java jvm-args -jar picard.jar PicardToolName OPTION1=value1 OPTION2=value2...

所有工具

1. AlignmentSummaryMetrics: 统计比对结果（SAM/BAM）, 由CollectAlignmentSummaryMetrics生成，结果在文件.alignment_summary_metrics中。

BaseDistributionByCycleMetrics: *
ClusteredCrosscheckMetric: 处理聚类的 crosschecking fingerprints结果*
CollectHiSeqXPfFailMetrics.PFFailDetailedMetric: a metric class for describing FP failing reads from an Illumina HiSeqX lane *
CollectHiSeqXPfFailMetrics.PFFailSummaryMetric: Metrics produced by the GetHiSeqXPFFailMetrics program.*
CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.*
CollectQualityYieldMetrics.QualityYieldMetrics: 用于描述 BAM 比对质量的一些指标。*
CollectRawWgsMetrics.RawWgsMetrics: *
CollectVariantCallingMetrics.VariantCallingDetailMetrics: 给定文件的 VCF 文件，与 SNP 和 Indel 相关的指标。*
CollectVariantCallingMetrics.VariantCallingSummaryMetrics: 同上*
CollectWgsMetrics.WgsMetrics: 用于评估全基因组测序结果。*
CollectWgsMetricsWithNonZeroCoverage.WgsMetricsWithNonZeroCoverage: 同上*
CrosscheckMetric: 处理 crosschecking fingerprints结果*
DuplicationMetrics: 对 SAM 标记 duplicates，并计算相关指标。*
ErrorSummaryMetrics: CollectSequencingArtifactMetrics 计算的summary 指标，计算每种碱基错误率。*
ExtractIlluminaBarcodes.BarcodeMetric:
ExtractIlluminaBarcodes计算的指标，分析 Basecalling 目录下的数据，确定每个reads 和 barcode 的关系。*
FingerprintingDetailMetrics: fingerprint 内，单个 SNP/杂合体比较的详细指标。*
FingerprintingSummaryMetrics: 总结 fingerprinting 指标，统计比较测序数据。*
GcBiasDetailMetrics:
Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.*
GcBiasMetrics: *
GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.*
GenotypeConcordanceContingencyMetrics: Class that holds metrics about the Genotype Concordance contingency tables.*
GenotypeConcordanceDetailMetrics: Class that holds detail metrics about Genotype Concordance*
GenotypeConcordanceSummaryMetrics: Class that holds summary metrics about Genotype Concordance*
HsMetrics:

Metrics generated by CollectHsMetrics for the analysis of target-capture sequencing experiments.
- IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.*
  IlluminaLaneMetrics: Embodies characteristics that describe a lane.*
  IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.*
  IndependentReplicateMetric: A class to store information relevant for biological rate estimation*
  InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".*
  JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".*
  MendelianViolationMetrics: Describes the type and number of mendelian violations found within a Trio.*
  MergeableMetricBase: An extension of MetricBase that knows how to merge-by-adding fields that are appropriately annotated.*
  MultilevelMetrics: *
  RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".*
  RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC*
  RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC*
  SequencingArtifactMetrics.BaitBiasDetailMetrics: Bait bias artifacts broken down by context.*
  SequencingArtifactMetrics.BaitBiasSummaryMetrics: Summary analysis of a single bait bias artifact, also known as a reference bias artifact.*
  SequencingArtifactMetrics.PreAdapterDetailMetrics: Pre-adapter artifacts broken down by context.*
  SequencingArtifactMetrics.PreAdapterSummaryMetrics: Summary analysis of a single pre-adapter artifact.*
  TargetedPcrMetrics: Metrics class for the analysis of reads obtained from targeted pcr experiments e.g.*
  UmiMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords using the UmiAwareDuplicateSetIterator.

详细功能

CollectHsMetrics:

分析靶向测序（hybrid-selection）的相关指标

该命令读取SAM/BAM文件。HS（杂交捕获，靶向测序，hybrid-selection）是靶向测序常用的技术，如外显子靶向测序，更多信息参考GATK Dictionary entry.

该命令需要
1）比对结果（SAM/BAM）
2）捕获位点信息（靶向试剂盒生产商提供）。若捕获位点是 bed 格式，则需要 BedToInterval 工具转换为 picard 所需的 interval_list 格式。
3）若有参考序列，则会同时计算 AT_DROPOUT and GC_DROPOUT 指标。
因为某些区域GC含量过多或过少，会使测序错误率增加，然后导致比对到这些区域的reads变少，即比对效率降低，覆盖度降低。

你可以使用 PER_TARGET_COVERAGE，获取每个捕获位点的GC含量和测序深度等信息。
标为 pct 的指标都是比例。

java -jar picard.jar CollectHsMetrics \
      I=input.bam \
      O=hs_metrics.txt \
      R=reference_sequence.fasta \
      BAIT_INTERVALS=bait.interval_list \
      TARGET_INTERVALS=target.interval_list
 # BAIT_INTERVALS 可以与 TARGET_INTERVALS 相同（但我还不太明白）

bait 与 target 区别:
计算 bait coverage 时，很少去除 reads，因此可以直观感受湿实验效果，但是计算 target coverage 时，因为对突变检测的贡献有限，去除了很多碱基。可以看一下各种 PCT_EXC 指标的描述，为什么在计算 target 时，过滤掉很多 reads 。大部分过滤条件可以通过参数调节。

详细的结果说明查看 CollectHsMetrics
CollectHsMetrics 分析的指标分为三类。
1 ) 基本测序指标，用来计算其他指标。比如基因组大小，reads 总数，比对的 reads 总数。
bait_set：捕获杂交用的 bait 名称
bait_territory：位于一个或多个 bait位点的碱基数量
target_territory：覆盖在target区域 unique base数量
bait_design_efficiency：设计效率。 target_territory/bait_territory 比例。值为1 表示设计效率极好，0.5表示一半 bait 碱基不在taget区域。
PF_READS：通过vendor's 过滤的reads总数。
PF_BASES_ALIGNED ：通过碱基质量控制（PF），且比对到基因组（比对分值>0）上 unique 碱基。
on_bait_bases：比对到基因组 bait 区域的（PF_BASES_ALIGNED ）碱基数量。
genome_size
total_reads： SAM 文件中 reads 总数。
pf_reads：通过平台/vendor 质控的 reads 总数。
pf_bases：PF_READS 的碱基量。
pf_unique_reads：非重复 reads
pf_uq_reads_aligned：比对reads中 unique 比例
pf_bases_aligned：比对上的碱基总数。
pf_uq_bases_aligned：比对 reads 中 unique reads 的碱基总数
on_target_bases：比对到 target 区域的碱基总数
pct_pf_reads：下机数据中通过质控的 reads 比例。
pct_pf_uq_reads：下机数据中通过质控且无重复的 reads 比例
pct_pf_uq_reads_aligned：通过质控的reads中，比对到reference 的无重复 reads 比例

2 ) 实验质量，比如比对到 bait 附近、内部、外部的碱基数量或比例， fold 80 碱基罚分，捕获文库大小，捕获罚分。在过滤之前得到这些指标，比如低比对质量，低质量碱基，重复reads。
near_bait_bases：比对到 bait 附近的 reads 碱基量。即有部分重叠。
off_bait_bases：没有比对到 bait 区域的碱基量。
pct_selected_bases：（near_bait_bases+on_bait_bases）/PF_BASES_ALIGNED
pct_off_bait：off_bait_bases/PF_BASES_ALIGNED。
on_bait_vs_selected：on-taget 中bait 完全覆盖的比例。

fold_80_base_penalty：测序均一度指标，非0覆盖区域上，使80%碱基达到平均coverage时，需要另外测序的倍数。值越低越好，最好值为1。

hs_library_size：被捕获的文库片段数量估计值
hs_penalty_10x：80% 靶向区区域碱基达到 10X时的捕获罚分。即：当设计10M的靶向区域时，要得到 10X coverage，需要测序，直到 PF_ALIGNED_BASES =10^7 * 10 * HS_PENALTY_10X.
hs_penalty_20x：想要80%区域到达 20X coverage。
hs_penalty_30x
hs_penalty_40x
hs_penalty_50x
hs_penalty_100x

3）target 覆盖度评估，评估下游分析中的可靠性。比如target 区域平均覆盖度，不同覆盖度水平的碱基比例，不同条件过滤的碱基比例。按照所有条件过滤后计算这些指标。
mean_bait_coverage：所有 bait 位点上的平均覆盖度。
pct_usable_bases_on_bait：可使用的 PF 碱基中，比对到 bait 上的去重的碱基数量。
pct_usable_bases_on_target：可使用的 PF 碱基中，比对到 target 上的去重的碱基数量。
fold_enrichment：扩增区域被扩增的倍数
mean_target_coverage： target 区域平均覆盖度。
median_target_coverage：覆盖度
max_target_coverage：覆盖度
min_target_coverage：覆盖度
zero_cvg_targets_pct：target 区域覆盖度<1的比例。

不同条件过滤的碱基比例：
pct_exc_dupe：标记为重复的 reads 。
pct_exc_adapter：adapter
pct_exc_mapq：低比对质量
pct_exc_baseq：低碱基碱基。
pct_exc_overlap：重复序列比例。 the second observation from an insert with overlapping reads. ？？？
pct_exc_off_target：比对到 taget 区域外。

不同覆盖度水平的碱基比例：
pct_target_bases_1x：比对到target 区域的，不小于 1X的碱基比例
pct_target_bases_2x
pct_target_bases_10x
pct_target_bases_20x
pct_target_bases_30x
pct_target_bases_40x
pct_target_bases_50x
pct_target_bases_100x
at_dropout：与平均覆盖度相比，低碱基含量（GC<50%）的区域，偏低的程度。结果是个比值，表示总reads中比对到低 GC含量区域的比例。
gc_dropout：高 GC含量的区域上 reads 比例。

het_snp_sensitivity：HET SNP 理论值。
het_snp_q：HET SNP 理论值的 Q 值，
sample
library
read_group

最后编辑于：2020.11.26 10:33:13

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 225,165评论 6赞 523
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 96,476评论 3赞 405
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 172,446评论 0赞 368
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 61,157评论 1赞 301
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 70,164评论 6赞 400
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 53,615评论 1赞 316
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 41,969评论 3赞 430
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 40,959评论 0赞 279
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 47,495评论 1赞 324
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 39,529评论 3赞 347
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 41,641评论 1赞 355
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 37,233评论 5赞 351
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 42,976评论 3赞 340
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 33,407评论 0赞 25
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 34,552评论 1赞 277
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 50,218评论 3赞 381
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 46,715评论 2赞 366

picard 使用

所有工具

详细功能

CollectHsMetrics:

分析靶向测序（hybrid-selection）的相关指标

推荐阅读更多精彩内容