ArchR官网教程学习笔记15:ArchR的整合分析

系列回顾:
ArchR官网教程学习笔记1:Getting Started with ArchR
ArchR官网教程学习笔记2:基于ArchR推测Doublet
ArchR官网教程学习笔记3:创建ArchRProject
ArchR官网教程学习笔记4:ArchR的降维
ArchR官网教程学习笔记5:ArchR的聚类
ArchR官网教程学习笔记6:单细胞嵌入(Single-cell Embeddings)
ArchR官网教程学习笔记7:ArchR的基因评分和Marker基因
ArchR官网教程学习笔记8:定义与scRNA-seq一致的聚类
ArchR官网教程学习笔记9:ArchR的伪批量重复
ArchR官网教程学习笔记10:ArchR的call peak
ArchR官网教程学习笔记11:鉴定Marker峰
ArchR官网教程学习笔记12:Motif和Feature富集
ArchR官网教程学习笔记13:ChromVAR偏差富集
ArchR官网教程学习笔记14:ArchR的Footprinting分析

在开始之前,先要对本章笔记说明一下。本章的中间“Peak2GeneLinkage”部分里的代码,因为按照官网的代码运行总是报错,所以我去ArchR的github上直接问的developer(见https://github.com/GreenleafLab/ArchR/issues/442)。根据developer说的,是代码本身的bug,所以他们在新版ArchR里修复了这个bug。(如何安装新版的ArchR下文里有提到)所以这就是对于任何我们想学习的R包或者软件,一定要自己亲自运行一下,否则即便你使用的官网示例数据和官方代码,仍然有可能报错。

ArchR的主要优势之一是它能够集成多层次的信息,从而提供新的角度。这可以采取ATAC-seq分析的形式,如识别峰共可接近性以预测调控相互作用,或整合scRNA-seq数据的分析,如通过peak-to-gene连锁分析预测增强子活性。在任何一种情况下,ArchR都可以轻松地从你的scATAC-seq数据中获得更深入的见解。

(一)创建低重叠的细胞聚集(Creating Low-Overlapping Aggregates of Cells)

ArchR方便了许多涉及特征相关性的综合分析。用稀疏的单细胞数据进行这些计算会导致相关性分析中大量的背景信号。为了规避这一挑战,我们采用了Cicero的方法,在这些分析之前创建低重叠的单细胞聚集。为了减少偏差,我们过滤与任何其他聚合重叠度超过80%的聚合( aggregates )。为了提高这种方法的速度,我们开发了一个优化的迭代重叠检查过程的实现,并使用“Rcpp”包在c++中实现了快速的特征相关性。这些优化方法在ArchR中用于计算峰共可接近性、peak-to-gene的连锁和其他连锁分析。这些低重叠聚合的使用是在后台进行的,但为了清楚起见,我们在这里提到它。

(二)共可接近性分析

协同可接近性是指跨多个单细胞的两个峰之间的可接近性相关性。换句话说,当A峰在单细胞内可接近时,B峰通常也可接近。下面我们将直观地说明这一概念,显示enhancer E3通常与promoter P可共同接近。

关于共可接近性分析需要注意的一件事是,它通常将特定细胞类型的峰标识为共可接近的。这是因为这些峰通常在单个细胞类型中都可以接近,而在所有其他细胞类型中通常都不能被接近。这产生了很强的相关性,但并不一定意味着这些峰之间存在调控关系。

为了计算ArchR中的协同可接近性,我们使用储存在ArchRProject中协同可接近性峰信息的addCoAccessibility()函数。

> projHeme5 <- addCoAccessibility(
    ArchRProj = projHeme5,
    reducedDims = "IterativeLSI"
)

我们可以通过getCoAccessibility()函数从ArchRProject检索这个协同可接近性信息,如果returnLoops = FALSE,该函数将返回一个DataFrame对象:

> cA <- getCoAccessibility(
    ArchRProj = projHeme5,
    corCutOff = 0.5,
    resolution = 1,
    returnLoops = FALSE
)

DataFrame包含了一些重要的信息。queryHitssubjectHits列表示是相关的两个峰的索引。correlation列给出了这两个峰之间的可接近性的数值相关性:

> cA
DataFrame with 66069 rows and 11 columns
      queryHits subjectHits seqnames       correlation         Variability1         Variability2
      <integer>   <integer>    <Rle>         <numeric>            <numeric>            <numeric>
1             7          11     chr1 0.674068131604548  0.00454332740253749   0.0306386713792158
2            11           7     chr1 0.674068131604548   0.0306386713792158  0.00454332740253749
3            20          21     chr1 0.555556855508682   0.0263183019286269  0.00550013127461839
4            21          20     chr1 0.555556855508682  0.00550013127461839   0.0263183019286269
5            44          55     chr1 0.556849339818455   0.0328129362145881   0.0276923110094824
...         ...         ...      ...               ...                  ...                  ...
66065    140763      140761     chrX 0.544862329728945  0.00105524816663243  0.00103018596537361
66066    140819      140825     chrX  0.52077899590599   0.0082391986288855  0.00123800124356212
66067    140825      140819     chrX  0.52077899590599  0.00123800124356212   0.0082391986288855
66068    140825      140826     chrX 0.536143354146812  0.00123800124356212 0.000926987273521345
66069    140826      140825     chrX 0.536143354146812 0.000926987273521345  0.00123800124356212
                 TStat                 Pval                  FDR       VarQuantile1      VarQuantile2
             <numeric>            <numeric>            <numeric>          <numeric>         <numeric>
1     14.8753870882021 1.57568490812292e-41 2.25369520385051e-39  0.497425136231847 0.922787369881207
2     14.8753870882021 1.57568490812297e-41 2.25369520385051e-39  0.922786833059823 0.497426746695999
3     12.2600711823064 2.64734422862833e-30 1.16921566549878e-28  0.897581458618855 0.553941154713533
4     12.2600711823064 2.64734422862833e-30 1.16921566549878e-28  0.553940081070765 0.897582532261623
5     12.2885938249175 2.02269876477295e-30 9.03837469990896e-29  0.933085751311052 0.906216767401199
...                ...                  ...                  ...                ...               ...
66065 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28  0.113733662512206 0.109280192310893
66066 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25  0.664993931234254  0.14330607891167
66067 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25  0.143306615733054  0.66499607851979
66068    11.8316525493  1.4493708075588e-28 5.40902049408846e-27  0.143306615733054 0.092152369234337
66069    11.8316525493 1.44937080755876e-28 5.40902049408846e-27 0.0921534428771049  0.14330607891167

这个共可接近性数据DataFrame还具有一个metadata组件,该metadata组件包含相关峰的GRanges对象。上面提到的queryHitssubject的索引适用于这个GRanges对象:

> metadata(cA)[[1]]
GRanges object with 140865 ranges and 0 metadata columns:
        seqnames              ranges strand
           <Rle>           <IRanges>  <Rle>
   Mono     chr1       752495-752995      *
  CD4.N     chr1       757871-758371      *
  CD4.N     chr1       762690-763190      *
    GMP     chr1       773670-774170      *
      B     chr1       801006-801506      *
    ...      ...                 ...    ...
     NK     chrX 154807254-154807754      *
   Mono     chrX 154840907-154841407      *
   Mono     chrX 154841994-154842494      *
     NK     chrX 154862043-154862543      *
    GMP     chrX 154996991-154997491      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

上面的代码里,returnLoops = FALSE。如果我们设置returnLoops = TRUE,那么getCoAccessibility()将以循环track的形式返回co-accessibility信息。在这个GRanges对象中,IRanges的开始和结束映射到相互作用的两个不同的可共同接近的峰。分辨率参数设置这些循环的碱基对分辨率。当resolution = 1时,会创建连接每个峰中心的loop:

> cA <- getCoAccessibility(
    ArchRProj = projHeme5,
    corCutOff = 0.5,
    resolution = 1,
    returnLoops = TRUE
)

> cA[[1]]
GRanges object with 33035 ranges and 9 metadata columns:
          seqnames              ranges strand |       correlation         Variability1
             <Rle>           <IRanges>  <Rle> |         <numeric>            <numeric>
      [1]     chr1       845576-856623      * | 0.674068131604548  0.00454332740253749
      [2]     chr1       894703-895398      * | 0.555556855508682   0.0263183019286269
      [3]     chr1      968510-1004214      * | 0.556849339818455   0.0328129362145881
      [4]     chr1       974306-975108      * | 0.572146747607589   0.0093636446649745
      [5]     chr1     1072037-1107432      * | 0.541661438734924  0.00679983990004832
      ...      ...                 ...    ... .               ...                  ...
  [33031]     chrX 153276031-153306080      * | 0.782117525371192   0.0140283037671134
  [33032]     chrX 153306080-153342756      * | 0.538505319617638  0.00154044262912208
  [33033]     chrX 153591527-153594212      * | 0.544862329728945  0.00103018596537361
  [33034]     chrX 153941234-153959650      * |  0.52077899590599   0.0082391986288855
  [33035]     chrX 153959650-153960343      * | 0.536143354146812 0.000926987273521345
                 Variability2            TStat                 Pval                  FDR
                    <numeric>        <numeric>            <numeric>            <numeric>
      [1]  0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
      [2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
      [3]  0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
      [4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
      [5]   0.013335574910477 11.9534260620727 4.68165878439767e-29 1.83531999911093e-27
      ...                 ...              ...                  ...                  ...
  [33031]   0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
  [33032]   0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
  [33033] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
  [33034] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
  [33035] 0.00123800124356212    11.8316525493 1.44937080755876e-28 5.40902049408846e-27
                VarQuantile1      VarQuantile2             value
                   <numeric>         <numeric>         <numeric>
      [1]  0.497425136231847 0.922787369881207 0.674068131604548
      [2]  0.897581458618855 0.553941154713533 0.555556855508682
      [3]  0.933085751311052 0.906216767401199 0.556849339818455
      [4]  0.696956276435098 0.581836004288129 0.572146747607589
      [5]  0.613238444785505 0.773817288547399 0.541661438734924
      ...                ...               ...               ...
  [33031]  0.783751168257537 0.763801275165515 0.782117525371192
  [33032]  0.190906567848586 0.763801275165515 0.538505319617638
  [33033]  0.109280729132277 0.113733125690822 0.544862329728945
  [33034]  0.664993931234254  0.14330607891167  0.52077899590599
  [33035] 0.0921534428771049  0.14330607891167 0.536143354146812
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

相反,如果我们将loop的分辨率降低到resolution = 1000,这将有助于过度绘制协同可接近性相互作用。下面,我们看到GRanges对象中的条目总数比上面的少:

> cA <- getCoAccessibility(
    ArchRProj = projHeme5,
    corCutOff = 0.5,
    resolution = 1000,
    returnLoops = TRUE
)

> cA[[1]]
GRanges object with 31594 ranges and 9 metadata columns:
          seqnames              ranges strand |       correlation         Variability1
             <Rle>           <IRanges>  <Rle> |         <numeric>            <numeric>
      [1]     chr1       845500-856500      * | 0.674068131604548  0.00454332740253749
      [2]     chr1       894500-895500      * | 0.555556855508682   0.0263183019286269
      [3]     chr1      968500-1004500      * | 0.556849339818455   0.0328129362145881
      [4]     chr1       974500-975500      * | 0.572146747607589   0.0093636446649745
      [5]     chr1     1072500-1079500      * | 0.653852889650734   0.0171476615215405
      ...      ...                 ...    ... .               ...                  ...
  [31590]     chrX 153276500-153306500      * | 0.782117525371192   0.0140283037671134
  [31591]     chrX 153306500-153342500      * | 0.538505319617638  0.00154044262912208
  [31592]     chrX 153591500-153594500      * | 0.544862329728945  0.00103018596537361
  [31593]     chrX 153941500-153959500      * |  0.52077899590599   0.0082391986288855
  [31594]     chrX 153959500-153960500      * | 0.536143354146812 0.000926987273521345
                 Variability2            TStat                 Pval                  FDR
                    <numeric>        <numeric>            <numeric>            <numeric>
      [1]  0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
      [2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
      [3]  0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
      [4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
      [5]  0.0405524614744011 14.4292755824871 1.50202844679026e-39 1.72121316754705e-37
      ...                 ...              ...                  ...                  ...
  [31590]   0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
  [31591]   0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
  [31592] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
  [31593] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
  [31594] 0.00123800124356212    11.8316525493 1.44937080755876e-28 5.40902049408846e-27
                VarQuantile1      VarQuantile2             value
                   <numeric>         <numeric>         <numeric>
      [1]  0.497425136231847 0.922787369881207 0.674068131604548
      [2]  0.897581458618855 0.553941154713533 0.555556855508682
      [3]  0.933085751311052 0.906216767401199 0.556849339818455
      [4]  0.696956276435098 0.581836004288129 0.572146747607589
      [5]   0.82168833546183 0.961734834930109 0.653852889650734
      ...                ...               ...               ...
  [31590]  0.783751168257537 0.763801275165515 0.782117525371192
  [31591]  0.190906567848586 0.763801275165515 0.538505319617638
  [31592]  0.109280729132277 0.113733125690822 0.544862329728945
  [31593]  0.664993931234254  0.14330607891167  0.52077899590599
  [31594] 0.0921534428771049  0.14330607891167 0.536143354146812
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

类似地,如果我们进一步降低分辨率,使resolution = 10000,我们将识别出更少的协同可接近性相互作用:

> cA <- getCoAccessibility(
    ArchRProj = projHeme5,
    corCutOff = 0.5,
    resolution = 10000,
    returnLoops = TRUE
)

> cA[[1]]
GRanges object with 21257 ranges and 9 metadata columns:
          seqnames              ranges strand |       correlation         Variability1
             <Rle>           <IRanges>  <Rle> |         <numeric>            <numeric>
      [1]     chr1       845000-855000      * | 0.674068131604548  0.00454332740253749
      [2]     chr1              895000      * | 0.555556855508682   0.0263183019286269
      [3]     chr1      965000-1005000      * | 0.556849339818455   0.0328129362145881
      [4]     chr1              975000      * | 0.572146747607589   0.0093636446649745
      [5]     chr1             1075000      * | 0.653852889650734   0.0171476615215405
      ...      ...                 ...    ... .               ...                  ...
  [21253]     chrX 153275000-153305000      * | 0.782117525371192   0.0140283037671134
  [21254]     chrX 153305000-153345000      * | 0.538505319617638  0.00154044262912208
  [21255]     chrX           153595000      * | 0.544862329728945  0.00103018596537361
  [21256]     chrX 153945000-153955000      * |  0.52077899590599   0.0082391986288855
  [21257]     chrX 153955000-153965000      * | 0.536143354146812 0.000926987273521345
                 Variability2            TStat                 Pval                  FDR
                    <numeric>        <numeric>            <numeric>            <numeric>
      [1]  0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
      [2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
      [3]  0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
      [4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
      [5]  0.0405524614744011 14.4292755824871 1.50202844679026e-39 1.72121316754705e-37
      ...                 ...              ...                  ...                  ...
  [21253]   0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
  [21254]   0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
  [21255] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
  [21256] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
  [21257] 0.00123800124356212    11.8316525493 1.44937080755876e-28 5.40902049408846e-27
                VarQuantile1      VarQuantile2             value
                   <numeric>         <numeric>         <numeric>
      [1]  0.497425136231847 0.922787369881207 0.674068131604548
      [2]  0.897581458618855 0.553941154713533 0.555556855508682
      [3]  0.933085751311052 0.906216767401199 0.556849339818455
      [4]  0.696956276435098 0.581836004288129 0.572146747607589
      [5]   0.82168833546183 0.961734834930109 0.653852889650734
      ...                ...               ...               ...
  [21253]  0.783751168257537 0.763801275165515 0.782117525371192
  [21254]  0.190906567848586 0.763801275165515 0.538505319617638
  [21255]  0.109280729132277 0.113733125690822 0.544862329728945
  [21256]  0.664993931234254  0.14330607891167  0.52077899590599
  [21257] 0.0921534428771049  0.14330607891167 0.536143354146812
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

一旦我们在ArchRProject中添加了可接近性信息,我们就可以在绘制browser tracks时使用它作为loop track。我们通过plotBrowserTrack()函数的loop参数来实现这一点。这里,我们使用getCoAccessibility()的默认参数,包括corCutOff = 0.5resolution = 1000returnLoops = TRUE:

> markerGenes  <- c(
    "CD34", #Early Progenitor
    "GATA1", #Erythroid
    "PAX5", "MS4A1", #B-Cell Trajectory
    "CD14", #Monocytes
    "CD3D", "CD8A", "TBX21", "IL7R" #TCells
  )

> p <- plotBrowserTrack(
    ArchRProj = projHeme5, 
    groupBy = "Clusters2", 
    geneSymbol = markerGenes, 
    upstream = 50000,
    downstream = 50000,
    loops = getCoAccessibility(projHeme5)
)

画图:

> grid::grid.newpage()
> grid::grid.draw(p$CD14)

保存图片:

> plotPDF(plotList = p, 
    name = "Plot-Tracks-Marker-Genes-with-CoAccessibility.pdf", 
    ArchRProj = projHeme5, 
    addDOC = FALSE, width = 5, height = 5)

(三)ArchR的Peak2GeneLinkage分析

这一部分请注意!你需要安装最新版的ArchR,如果你安装的是1.0.0,这一部分是要报错的!安装新版的代码:
devtools::install_github("GreenleafLab/ArchR", ref="release_1.0.1", repos = BiocManager::repositories())
安装后为了确定确实是新版了,检查一下:
packageVersion("ArchR")
[1] ‘1.0.1’

与共可接近性相似,ArchR也可以鉴定所谓的“peak-to-gene links”。peak-to-gene links和共可接近性之间的主要区别是,共可接近性是一种仅ATAC-seq分析,寻找可接近性在两个峰之间的相关性,而peak-to-gene linkage利用整合的scRNA-seq数据来寻找峰可接近性和基因表达之间的相关性。这些代表了一个类似问题的正交方法。然而,由于peak-to-gene链接与scATAC-seq和scRNA-seq数据相关,我们通常认为这些links与基因调控相互作用更相关。

为了识别ArchR中的peak-to-gene links,我们使用addPeak2GeneLinks()函数:

> projHeme5 <- addPeak2GeneLinks(
    ArchRProj = projHeme5,
    reducedDims = "IterativeLSI"
)

然后,我们可以检索这些peak-to-gene links,其方式类似于使用getPeak2GeneLinks()函数检索协同可接近性相互作用。正如我们在前面看到的,这个函数允许用户指定相关性的cutoff和linkages的分辨率,首先我们尝试使用returnLoops = FALSE:

> p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1,
    returnLoops = FALSE
)

returnLoops设置为false时,该函数将返回一个DataFrame对象,类似于getCoAccessibility()返回的DataFrame对象。主要区别在于scATAC-seq峰的索引存储在idxATAC列中,而scRNA-seq基因的索引存储在idxRNA列中。

这个peak-to-gene linkage的dataframe还有一个metadata组件,该组件包含相关峰的GRanges对象。上述idxATAC的各项指标均适用于该对象。

> metadata(p2g)[[1]]
GRanges object with 140865 ranges and 0 metadata columns:
           seqnames              ranges strand
              <Rle>           <IRanges>  <Rle>
       [1]     chr1       752495-752995      *
       [2]     chr1       757871-758371      *
       [3]     chr1       762690-763190      *
       [4]     chr1       773670-774170      *
       [5]     chr1       801006-801506      *
       ...      ...                 ...    ...
  [140861]     chrX 154807254-154807754      *
  [140862]     chrX 154840907-154841407      *
  [140863]     chrX 154841994-154842494      *
  [140864]     chrX 154862043-154862543      *
  [140865]     chrX 154996991-154997491      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

如果我们设置returnLoops = TRUE,那么getPeak2GeneLinks()将返回一个loop track GRanges对象,该对象连接peak和gene。在共可接近性方面,IRanges对象的开始和结束代表了峰和被连接基因的位置。当resolution = 1时,它将峰的中心连接到基因的单碱基TSS上。

> p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1,
    returnLoops = TRUE
)

> p2g[[1]]
GRanges object with 151425 ranges and 2 metadata columns:
           seqnames              ranges strand |             value
              <Rle>           <IRanges>  <Rle> |         <numeric>
       [1]     chr1       752745-901877      * |  0.45482431317995
       [2]     chr1       752745-935552      * | 0.323287626544222
       [3]     chr1       752745-948847      * | 0.452241013027625
       [4]     chr1       762902-845576      * | 0.399410417826805
       [5]     chr1       762902-846795      * | 0.277657378194161
       ...      ...                 ...    ... .               ...
  [151421]     chrX 154299547-154322956      * | 0.339960030862308
  [151422]     chrX 154299547-154417246      * | 0.184827972853627
  [151423]     chrX 154299547-154445180      * | 0.234457969710443
  [151424]     chrX 154322956-154444701      * | 0.272574364121956
  [151425]     chrX 154444701-154493813      * | 0.262058933222826
                            FDR
                      <numeric>
       [1] 3.64903693825748e-25
       [2] 1.36105300270992e-12
       [3]  7.3782745193054e-25
       [4] 3.72871411563353e-19
       [5] 1.80208567788527e-09
       ...                  ...
  [151421] 7.03940344317308e-14
  [151422] 9.55454380815343e-05
  [151423]  5.1496329871895e-07
  [151424] 3.70925182517263e-09
  [151425] 1.57206595271312e-08
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

我们也可以通过设置resolution = 1000来降低这些links的分辨率。这主要用于在browser tracks时绘制links,因为在某些情况下,许多峰的附近都链接到同一基因,这可能很难可视化。

> p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1000,
    returnLoops = TRUE
)

> p2g[[1]]
GRanges object with 37320 ranges and 2 metadata columns:
          seqnames              ranges strand |             value
             <Rle>           <IRanges>  <Rle> |         <numeric>
      [1]     chr1       752500-901500      * |  0.45482431317995
      [2]     chr1       752500-948500      * | 0.452241013027625
      [3]     chr1       762500-856500      * | 0.510133486947831
      [4]     chr1       762500-901500      * | 0.465953111794918
      [5]     chr1       762500-999500      * | 0.622693553092798
      ...      ...                 ...    ... .               ...
  [37316]     chrX 153881500-153980500      * | 0.686663730276686
  [37317]     chrX 153941500-153979500      * | 0.461993816785105
  [37318]     chrX 153946500-153991500      * | 0.493712263022568
  [37319]     chrX 153980500-153991500      * |  0.54404072487661
  [37320]     chrX 153991500-154012500      * | 0.454090683118072
                           FDR
                     <numeric>
      [1] 3.64903693825748e-25
      [2]  7.3782745193054e-25
      [3] 2.15908329887624e-32
      [4] 1.63066517085523e-26
      [5] 4.50860678157444e-52
      ...                  ...
  [37316] 2.08980297150101e-67
  [37317] 4.99062691501854e-26
  [37318] 4.19132973592781e-30
  [37319] 1.54059545203608e-37
  [37320] 4.45896733710535e-25
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

继续降低分辨率,甚至可能会减少总的peak-to-gene links鉴定数量。

> p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 10000,
    returnLoops = TRUE
)

> p2g[[1]]
GRanges object with 30123 ranges and 2 metadata columns:
          seqnames              ranges strand |             value
             <Rle>           <IRanges>  <Rle> |         <numeric>
      [1]     chr1       755000-905000      * |  0.45482431317995
      [2]     chr1       755000-945000      * | 0.452241013027625
      [3]     chr1       765000-855000      * | 0.510133486947831
      [4]     chr1       765000-905000      * | 0.465953111794918
      [5]     chr1       765000-995000      * | 0.622693553092798
      ...      ...                 ...    ... .               ...
  [30119]     chrX 153885000-153985000      * | 0.686663730276686
  [30120]     chrX 153945000-153975000      * | 0.461993816785105
  [30121]     chrX 153945000-153995000      * | 0.493712263022568
  [30122]     chrX 153985000-153995000      * |  0.54404072487661
  [30123]     chrX 153995000-154015000      * | 0.454090683118072
                           FDR
                     <numeric>
      [1] 3.64903693825748e-25
      [2]  7.3782745193054e-25
      [3] 2.15908329887624e-32
      [4] 1.63066517085523e-26
      [5] 4.50860678157444e-52
      ...                  ...
  [30119] 2.08980297150101e-67
  [30120] 4.99062691501854e-26
  [30121] 4.19132973592781e-30
  [30122] 1.54059545203608e-37
  [30123] 4.45896733710535e-25
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths
(1)用browser tracks绘制peak-to-gene links

为了将这些peak-to-gene链接绘制成browser track,我们使用了与前一节中显示的可接近性相同的分析流程。这里我们使用plotBrowserTrack()函数:

> markerGenes  <- c(
    "CD34", #Early Progenitor
    "GATA1", #Erythroid
    "PAX5", "MS4A1", #B-Cell Trajectory
    "CD14", #Monocytes
    "CD3D", "CD8A", "TBX21", "IL7R" #TCells
  )

> p <- plotBrowserTrack(
    ArchRProj = projHeme5, 
    groupBy = "Clusters2", 
    geneSymbol = markerGenes, 
    upstream = 50000,
    downstream = 50000,
    loops = getPeak2GeneLinks(projHeme5)
)

画某一个基因的peak-to-gene links:

> grid::grid.newpage()
> grid::grid.draw(p$CD14)

保存图片:

> plotPDF(plotList = p, 
    name = "Plot-Tracks-Marker-Genes-with-Peak2GeneLinks.pdf", 
    ArchRProj = projHeme5, 
    addDOC = FALSE, width = 5, height = 5)
(2)peak-to-gene links分析的热图

为了可视化所有peak-to-gene links的对应关系,我们可以绘制一个peak-to-gene热图,其中包含两个并排的热图,一个用于我们的scATAC-seq数据,另一个用于我们的scRNA-seq数据。为此,我们使用plotPeak2GeneHeatmap():

> p <- plotPeak2GeneHeatmap(ArchRProj = projHeme5, groupBy = "Clusters2")
> p

根据传递给参数“k”的值,使用k-means聚类对heatmap行进行聚类,参数“k”默认为25,如下图所示:

(四)鉴定正向调控分子

ATAC-seq允许对在含有DNA结合motif的位点上染色质可接近性发生巨大变化的TFs进行无偏差的鉴定。然而,当通过位置权重矩阵(PWMs)进行聚合时,TFs家族(例如GATA因子)在其结合motif上具有相似的特征。

这种motif相似性使得识别可能在预测结合位点上引起染色质可接近性变化的特定TFs具有挑战性。为了规避这一难题,我们之前用ATAC-seq和RNA-seq来鉴定TFs的基因表达与其相应motif可接近性的改变呈正相关。我们称这些TFs为“正向调控分子”。然而,这种分析依赖于匹配的基因表达数据,这可能在所有的实验中并不容易得到。为了克服这种依赖性,ArchR可以识别出推断的基因评分与其chromVAR TF偏差z-score相关的TFs。为了实现这一点,ArchR将低重叠细胞聚集的TF motif的染色质偏差z分数与TF基因活性分数关联起来。当使用与ArchR结合的scRNA-seq整合时,可以使用TF的基因表达,而不是推断的基因活性评分。

(1)鉴定偏移的TF motif

识别正向调控的TF的第一部分是识别偏移的TF motif。我们在前一章中进行了此分析,为所有的motifs创建了一个chromVAR偏差和偏差z-score的MotifMatrix。我们可以通过使用getGroupSE()函数来获得这些按cluster平均的数据,该函数返回一个SummarizedExperiment:

> seGroupMotif <- getGroupSE(ArchRProj = projHeme5, useMatrix = "MotifMatrix", groupBy = "Clusters2")

因为这个SummarizedExperiment对象来自MotifMatrix,所以有两个seqname -“deviations”和“z”-对应于原始偏离和来自chromVAR的偏离z-score。

> seGroupMotif
class: SummarizedExperiment 
dim: 1740 11 
metadata(0):
assays(1): MotifMatrix
rownames(1740): f1 f2 ... f1739 f1740
rowData names(3): seqnames idx name
colnames(11): B CD4.M ... PreB Progenitor
colData names(20): TSSEnrichment ReadsInTSS ... FRIP nCells

我们可以把这个SummarizedExperiment取子集化,只取偏差z-score:

> seZ <- seGroupMotif[rowData(seGroupMotif)$seqnames=="z",]

然后我们可以确定所有cluster之间z-score的最大delta。这将有助于分层motifs的(基于不同cluster之间的偏差程度):

> rowData(seZ)$maxDelta <- lapply(seq_len(ncol(seZ)), function(x){
  rowMaxs(assay(seZ) - assay(seZ)[,x])
}) %>% Reduce("cbind", .) %>% rowMaxs
(2)鉴定TF motifs和TF基因分数/表达的相关性

为了识别TFs motif可接近性与其自身基因活性(通过基因得分或基因表达)的相关性,我们使用correlateMatrices()函数并提供我们感兴趣的两个矩阵,在本例中为GeneScoreMatrixMotifMatrix。如前所述,这些相关性是在reducedDims参数中指定的低维空间中确定的许多低重叠的细胞集合中确定的:

> corGSM_MM <- correlateMatrices(
    ArchRProj = projHeme5,
    useMatrix1 = "GeneScoreMatrix",
    useMatrix2 = "MotifMatrix",
    reducedDims = "IterativeLSI"
)

此函数返回一个DataFrame对象,该对象包含来自每个矩阵的元素以及跨低重叠细胞聚合的相关性:

> corGSM_MM
DataFrame with 825 rows and 14 columns
    GeneScoreMatrix_name MotifMatrix_name                 cor
                 <array>          <array>           <numeric>
1                   HES4          HES4_95  0.0964222952027804
2                   HES5          HES5_98   0.389610876712439
3                 PRDM16       PRDM16_211   0.510854729692758
4                   TP73         TP73_705   0.528529747990005
5               TP73-AS1         TP73_705  -0.122421749585615
...                  ...              ...                 ...
821                TFDP3        TFDP3_309  -0.114905134028301
822               ZNF75D       ZNF75D_272 -0.0780911768193981
823                 ZIC3         ZIC3_215   0.319287832283537
824                 SOX3         SOX3_759 -0.0830413458998048
825                MECP2        MECP2_645   0.143645914605916
                    padj                 pval GeneScoreMatrix_seqnames
               <numeric>            <numeric>                    <Rle>
1                      1   0.0330289085440269                     chr1
2   2.91738527794111e-16 3.56648566985466e-19                     chr1
3   6.00864881004139e-31 7.34553644259338e-34                     chr1
4   1.26528717256258e-33 1.54680583442858e-36                     chr1
5                      1   0.0067196460862283                     chr1
...                  ...                  ...                      ...
821                    1   0.0109948840206202                     chrX
822                    1   0.0845143633274378                     chrX
823 3.87543709985036e-10 4.73769816607624e-13                     chrX
824                    1   0.0665364665390267                     chrX
825                    1  0.00144806861367396                     chrX
    GeneScoreMatrix_start GeneScoreMatrix_end GeneScoreMatrix_strand
                  <array>             <array>                <array>
1                  935552              934342                      2
2                 2461684             2460184                      2
3                 2985742             3355185                      1
4                 3569129             3652765                      1
5                 3663937             3652548                      2
...                   ...                 ...                    ...
821             132352376           132350697                      2
822             134429965           134419723                      2
823             136648346           136654259                      1
824             139587225           139585152                      2
825             153363188           153287264                      2
    GeneScoreMatrix_idx GeneScoreMatrix_matchName MotifMatrix_seqnames
                <array>                   <array>                <Rle>
1                    15                      HES4                    z
2                    74                      HES5                    z
3                    82                    PRDM16                    z
4                    89                      TP73                    z
5                    90                      TP73                    z
...                 ...                       ...                  ...
821                 697                     TFDP3                    z
822                 728                    ZNF75D                    z
823                 753                      ZIC3                    z
824                 765                      SOX3                    z
825                 874                     MECP2                    z
    MotifMatrix_idx MotifMatrix_matchName
            <array>               <array>
1                95                  HES4
2                98                  HES5
3               211                PRDM16
4               705                  TP73
5               705                  TP73
...             ...                   ...
821             309                 TFDP3
822             272                ZNF75D
823             215                  ZIC3
824             759                  SOX3
825             645                 MECP2

我们可以使用GeneIntegrationMatrix代替GeneScoreMatrix进行同样的分析:

> corGIM_MM <- correlateMatrices(
  ArchRProj = projHeme5,
  useMatrix1 = "GeneIntegrationMatrix",
  useMatrix2 = "MotifMatrix",
  reducedDims = "IterativeLSI"
)
> corGIM_MM
DataFrame with 798 rows and 14 columns
    GeneIntegrationMatrix_name MotifMatrix_name                cor
                       <array>          <array>          <numeric>
1                         HES4          HES4_95 -0.171384934997311
2                         HES5          HES5_98 -0.169840367432083
3                       PRDM16       PRDM16_211  0.252222396399267
4                         TP73         TP73_705 -0.254496075318124
5                         HES2          HES2_19 -0.320907300946354
...                        ...              ...                ...
794                      TFDP3        TFDP3_309                 NA
795                     ZNF75D       ZNF75D_272  0.281512276845826
796                       ZIC3         ZIC3_215                 NA
797                       SOX3         SOX3_759                 NA
798                      MECP2        MECP2_645  0.189446439244001
                    padj                 pval GeneIntegrationMatrix_seqnames
               <numeric>            <numeric>                          <Rle>
1     0.0745474147720114 0.000139863817583511                           chr1
2     0.0857722946054064 0.000160923629653671                           chr1
3   8.31213157194671e-06 1.55949935683803e-08                           chr1
4   6.10714657063681e-06 1.14580611081366e-08                           chr1
5   1.89680123127737e-10 3.55872651271552e-13                           chr1
...                  ...                  ...                            ...
794                   NA                   NA                           chrX
795 1.24170853148747e-07 2.32965953374759e-10                           chrX
796                   NA                   NA                           chrX
797                   NA                   NA                           chrX
798   0.0132063045846304 2.47773069130027e-05                           chrX
    GeneIntegrationMatrix_start GeneIntegrationMatrix_end
                        <array>                   <array>
1                        935552                    934342
2                       2461684                   2460184
3                       2985742                   3355185
4                       3569129                   3652765
5                       6484730                   6472498
...                         ...                       ...
794                   132352376                 132350697
795                   134429965                 134419723
796                   136648346                 136654259
797                   139587225                 139585152
798                   153363188                 153287264
    GeneIntegrationMatrix_strand GeneIntegrationMatrix_idx
                         <array>                   <array>
1                              2                         8
2                              2                        53
3                              1                        59
4                              1                        64
5                              2                        81
...                          ...                       ...
794                            2                       562
795                            2                       576
796                            1                       595
797                            2                       602
798                            2                       680
    GeneIntegrationMatrix_matchName MotifMatrix_seqnames MotifMatrix_idx
                            <array>                <Rle>         <array>
1                              HES4                    z              95
2                              HES5                    z              98
3                            PRDM16                    z             211
4                              TP73                    z             705
5                              HES2                    z              19
...                             ...                  ...             ...
794                           TFDP3                    z             309
795                          ZNF75D                    z             272
796                            ZIC3                    z             215
797                            SOX3                    z             759
798                           MECP2                    z             645
    MotifMatrix_matchName
                  <array>
1                    HES4
2                    HES5
3                  PRDM16
4                    TP73
5                    HES2
...                   ...
794                 TFDP3
795                ZNF75D
796                  ZIC3
797                  SOX3
798                 MECP2
(3)添加Maximum Delta Deviation到相关性Dataframe

对于每一个相关分析,我们可以用我们在步骤1中计算出的cluster间观察到的最大delta注释每个motif。

> corGSM_MM$maxDelta <- rowData(seZ)[match(corGSM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
> corGIM_MM$maxDelta <- rowData(seZ)[match(corGIM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
(4)鉴定正向调控TF

我们可以利用所有这些信息来鉴定正向调控的TF。在下面的例子中,我们认为motif与基因评分(或基因表达)的相关性大于0.5,且调整后的p值小于0.01,且偏差z-score的聚类间最大差异位于前四分位数的TFs为正调控因子。

我们应用这些标准来进行选择,并做一些文本处理来提取TF名称:

> corGSM_MM <- corGSM_MM[order(abs(corGSM_MM$cor), decreasing = TRUE), ]
> corGSM_MM <- corGSM_MM[which(!duplicated(gsub("\\-.*","",corGSM_MM[,"MotifMatrix_name"]))), ]
> corGSM_MM$TFRegulator <- "NO"
> corGSM_MM$TFRegulator[which(corGSM_MM$cor > 0.5 & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "YES"
> sort(corGSM_MM[corGSM_MM$TFRegulator=="YES",1])
 [1] "ATOH1"    "BCL11A"   "CEBPA-DT" "CEBPB"    "CEBPD"    "CREB1"   
 [7] "CREB3L4"  "EBF1"     "EGR2"     "EOMES"    "ERF"      "ESR1"    
[13] "ETS1"     "FUBP1"    "GABPA"    "GATA1"    "GATA2"    "GATA5"   
[19] "GATA6"    "IRF1"     "JDP2"     "KLF11"    "KLF2"     "LYL1"    
[25] "MECOM"    "MITF"     "MYOD1"    "NFE2"     "NFIA"     "NFIC"    
[31] "NFIX"     "PAX5"     "POU2F1"   "PRDM16"   "RUNX2"    "SIX5"    
[37] "SMAD1"    "SP4"      "SPI1"     "SPIB"     "TAL1"     "TCF15"   
[43] "TCF23"    "TFAP2C"   "TWIST1"   "YY1"  

从基因分数和motif偏差z-score中识别出这些正向调控的TF,我们可以在点图中突出它们。

> p <- ggplot(data.frame(corGSM_MM), aes(cor, maxDelta, color = TFRegulator)) +
  geom_point() + 
  theme_ArchR() +
  geom_vline(xintercept = 0, lty = "dashed") + 
  scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
  xlab("Correlation To Gene Score") +
  ylab("Max TF Motif Delta") +
  scale_y_continuous(
    expand = c(0,0), 
    limits = c(0, max(corGSM_MM$maxDelta)*1.05)
  )

> p

我们可以对从GeneIntegrationMatrix中得到的相关进行同样的分析:

> corGIM_MM <- corGIM_MM[order(abs(corGIM_MM$cor), decreasing = TRUE), ]
> corGIM_MM <- corGIM_MM[which(!duplicated(gsub("\\-.*","",corGIM_MM[,"MotifMatrix_name"]))), ]
> corGIM_MM$TFRegulator <- "NO"
> corGIM_MM$TFRegulator[which(corGIM_MM$cor > 0.5 & corGIM_MM$padj < 0.01 & corGIM_MM$maxDelta > quantile(corGIM_MM$maxDelta, 0.75))] <- "YES"
> sort(corGIM_MM[corGIM_MM$TFRegulator=="YES",1])
 [1] "BACH1" "CEBPA" "CEBPB" "CEBPD" "CTCF"  "EBF1"  "EOMES"
 [8] "ETS1"  "FOS"   "FOSB"  "FOSL1" "FOSL2" "GATA1" "GATA2"
[15] "IRF1"  "IRF2"  "IRF9"  "JDP2"  "KLF2"  "MITF"  "NFE2" 
[22] "NFIA"  "NFIB"  "NFIC"  "NFIX"  "NFKB2" "NR4A1" "NRF1" 
[29] "PAX5"  "RARA"  "RUNX1" "SP4"   "SPI1"  "STAT2" "TCF3" 
[36] "TCF4"  "TGIF2"
> p <- ggplot(data.frame(corGIM_MM), aes(cor, maxDelta, color = TFRegulator)) +
  geom_point() + 
  theme_ArchR() +
  geom_vline(xintercept = 0, lty = "dashed") + 
  scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
  xlab("Correlation To Gene Expression") +
  ylab("Max TF Motif Delta") +
  scale_y_continuous(
    expand = c(0,0), 
    limits = c(0, max(corGIM_MM$maxDelta)*1.05)
  )

> p
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,723评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,485评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,998评论 0 344
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,323评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,355评论 5 374
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,079评论 1 285
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,389评论 3 400
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,019评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,519评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,971评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,100评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,738评论 4 324
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,293评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,289评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,517评论 1 262
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,547评论 2 354
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,834评论 2 345

推荐阅读更多精彩内容