Compiled: 2021-03-18
Source: vignettes/multimodal_vignette.Rmd
教程链接:https://satijalab.org/seurat/articles/multimodal_vignette.html
对同一个细胞同时测量多种数据类型,称为multimodal analysis。例如,
- CITE-seq可以同时测量同一细胞的转录组和细胞表面蛋白。
- 10x multiome试剂盒允许成对测量细胞转录组和染色质可及性(即scRNA-seq+scATAC-seq)
我们设计了Seurat4以实现对多种多模态单细胞数据集的无缝存储、分析和探索。
1.数据介绍和加载
此篇教程,我们使用8,617 cord blood mononuclear cells (CBMCs)脐带血单个核细胞,配对的单细胞转录组数据和11个表面蛋白。
首先,我们加载进来两个count矩阵,一个是RNA,一个是antibody-derived tags (ADT)。数据可以从这里下载:
- RNA:ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100866/suppl/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz
- ADT:ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100866/suppl/GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz
# Note that this dataset also contains ~5% of mouse cells, which we can use as negative controls
# for the protein measurements. For this reason, the gene expression matrix has HUMAN_ or MOUSE_
# appended to the beginning of each gene.
cbmc.rna <- as.sparse(read.csv(file = "data/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz", sep = ",", header = TRUE, row.names = 1))
# To make life a bit easier going forward, we're going to discard all but the top 100 most
# highly expressed mouse genes, and remove the 'HUMAN_' from the CITE-seq prefix
cbmc.rna <- CollapseSpeciesExpressionMatrix(cbmc.rna)
# Load in the ADT UMI matrix
cbmc.adt <- as.sparse(read.csv(file = "data/GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz", sep = ",", header = TRUE, row.names = 1))
# Note that since measurements were made in the same cells, the two matrices have identical
# column names
all.equal(colnames(cbmc.rna), colnames(cbmc.adt))
2.构造Seurat对象
现在,我们构造一个Seurat对象,然后将ADT数据添加为第二个assay
# creates a Seurat object based on the scRNA-seq data
cbmc <- CreateSeuratObject(counts = cbmc.rna)
# We can see that by default, the cbmc object contains an assay storing RNA measurement
Assays(cbmc)
[1] "RNA"
# create a new assay to store ADT information
adt_assay <- CreateAssayObject(counts = cbmc.adt)
# add this assay to the previously created Seurat object
cbmc[["ADT"]] <- adt_assay
# Validate that the object now contains multiple assays
Assays(cbmc)
[1] "RNA" "ADT"
提取ADT中测量的基因
# Extract a list of features measured in the ADT assay
rownames(cbmc[["ADT"]])
[1] "CD3" "CD4" "CD8" "CD45RA" "CD56" "CD16" "CD10" "CD11c" "CD14" "CD19" "CD34" "CCR5" "CCR7"
对于assay,我们可以在RNA和ADT之间来回切换以便于后续的分析和可视化
# List the current default assay
DefaultAssay(cbmc)
# Switch the default to ADT
DefaultAssay(cbmc) <- "ADT"
DefaultAssay(cbmc)
3.基于单细胞数据对细胞进行聚类
这个地方使用的是一个简版必要步骤的代码,省略了很多参数和可视化的解读,如果要看详细的,还是需要去看那篇入门教程:https://satijalab.org/seurat/articles/pbmc3k_tutorial.html,中文版://www.greatytc.com/p/2b5c5c849ec0
# Note that all operations below are performed on the RNA assay Set and verify that the default assay is RNA
DefaultAssay(cbmc) <- "RNA"
DefaultAssay(cbmc)
# perform visualization and clustering steps
cbmc <- NormalizeData(cbmc)
cbmc <- FindVariableFeatures(cbmc)
cbmc <- ScaleData(cbmc)
cbmc <- RunPCA(cbmc, verbose = FALSE)
cbmc <- FindNeighbors(cbmc, dims = 1:30)
cbmc <- FindClusters(cbmc, resolution = 0.8, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:30)
DimPlot(cbmc, label = TRUE)
4.并排可视化多种模式
现在,经过前面的处理,我们有了单细胞数据的聚类信息,我们就可以对RNA或蛋白的表达进行可视化。值得注意的是:Seurat提供了好几种方法在不同模态数据之间进行切换,这一点尤其重要,因为在某些情况下,相同的特征可以以多种形式出现——例如,该数据集包含B细胞标记物CD19(蛋白质和RNA水平)的独立测量。
# Normalize ADT data,
DefaultAssay(cbmc) <- "ADT"
cbmc <- NormalizeData(cbmc, normalization.method = "CLR", margin = 2)
DefaultAssay(cbmc) <- "RNA"
# Note that the following command is an alternative but returns the same result
cbmc <- NormalizeData(cbmc, normalization.method = "CLR", margin = 2, assay = "ADT")
# Now, we will visualize CD14 levels for RNA and protein By setting the default assay, we can
# visualize one or the other
DefaultAssay(cbmc) <- "ADT"
p1 <- FeaturePlot(cbmc, "CD19", cols = c("lightgrey", "darkgreen")) + ggtitle("CD19 protein")
DefaultAssay(cbmc) <- "RNA"
p2 <- FeaturePlot(cbmc, "CD19") + ggtitle("CD19 RNA")
# place plots side-by-side
p1 | p2
或者,我们可以使用特定的assay关键词来指定特定的方式识别RNA和蛋白质assay关键词
# Alternately, we can use specific assay keys to specify a specific modality Identify the key for the RNA and protein assays
Key(cbmc[["RNA"]])
[1] "rna_"
Key(cbmc[["ADT"]])
[1] "adt_"
现在,我们可以在基因名字中加上key进行可视化
# Now, we can include the key in the feature name, which overrides the default assay
p1 <- FeaturePlot(cbmc, "adt_CD19", cols = c("lightgrey", "darkgreen")) + ggtitle("CD19 protein")
p2 <- FeaturePlot(cbmc, "rna_CD19") + ggtitle("CD19 RNA")
p1 | p2
5.识别scRNA-seq簇的细胞表面标记物
我们可以利用我们配对的CITE-seq测量来帮助注释来自scRNA-seq的cluster,并识别蛋白质和RNA marker。
# as we know that CD19 is a B cell marker, we can identify cluster 5 as expressing CD19 on the surface
VlnPlot(cbmc, "adt_CD19")
我们也可以通过差异表达来筛选候选的cluster的蛋白和RNA marker
adt_markers <- FindMarkers(cbmc, ident.1 = 5, assay = "ADT")
rna_markers <- FindMarkers(cbmc, ident.1 = 5, assay = "RNA")
head(adt_markers)
p_val avg_logFC pct.1 pct.2 p_val_adj
CD19 7.164366e-218 2.0582747 1 1 9.313675e-217
CD45RA 7.330397e-110 0.8163007 1 1 9.529515e-109
CD4 1.736704e-108 -1.1652553 1 1 2.257715e-107
CD14 9.016660e-106 -0.7940111 1 1 1.172166e-104
CD3 9.578480e-89 -1.1131282 1 1 1.245202e-87
CD8 1.218701e-18 -0.8828616 1 1 1.584311e-17
head(rna_markers)
p_val avg_logFC pct.1 pct.2 p_val_adj
IGHM 0 3.260649 0.971 0.044 0
TCL1A 0 2.917187 0.902 0.028 0
CD79A 0 2.888065 0.957 0.045 0
CD79B 0 2.615201 0.945 0.089 0
IGLC2 0 2.591397 0.286 0.005 0
MS4A1 0 2.493754 0.850 0.016 0
6.多模态数据的额外可视化
绘制ADT散点图(如FACS的双轴图)。请注意,如果需要,您甚至可以使用HoverLocator和FeatureLocator来“gate”细胞。
FeatureScatter(cbmc, feature1 = "adt_CD19", feature2 = "adt_CD3")
# view relationship between protein and RNA
FeatureScatter(cbmc, feature1 = "adt_CD3", feature2 = "rna_CD3E")
与RNA数据相比,蛋白的原始count要高很多。
# Let's look at the raw (non-normalized) ADT counts. You can see the values are quite high, particularly in comparison to RNA values. This is due to the significantly higher protein copy number in cells, which significantly reduces 'drop-out' in ADT data
FeatureScatter(cbmc, feature1 = "adt_CD4", feature2 = "adt_CD8", slot = "counts")
7.加载来自10X多模态实验的数据
Seurat还能够分析CellRanger v3处理的多模态10X实验数据;例如,我们使用10X Genomics免费提供的7,900个外周血单个核细胞(PBMC)数据集重建上面的图。
pbmc10k.data <- Read10X(data.dir = "../data/pbmc10k/filtered_feature_bc_matrix/")
rownames(x = pbmc10k.data[["Antibody Capture"]]) <- gsub(pattern = "_[control_]*TotalSeqB", replacement = "", x = rownames(x = pbmc10k.data[["Antibody Capture"]]))
pbmc10k <- CreateSeuratObject(counts = pbmc10k.data[["Gene Expression"]], min.cells = 3, min.features = 200)
pbmc10k <- NormalizeData(pbmc10k)
pbmc10k[["ADT"]] <- CreateAssayObject(pbmc10k.data[["Antibody Capture"]][, colnames(x = pbmc10k)])
pbmc10k <- NormalizeData(pbmc10k, assay = "ADT", normalization.method = "CLR")
plot1 <- FeatureScatter(pbmc10k, feature1 = "adt_CD19", feature2 = "adt_CD3", pt.size = 1)
plot2 <- FeatureScatter(pbmc10k, feature1 = "adt_CD4", feature2 = "adt_CD8a", pt.size = 1)
plot3 <- FeatureScatter(pbmc10k, feature1 = "adt_CD3", feature2 = "CD3E", pt.size = 1)
(plot1 + plot2 + plot3) & NoLegend()
8.Seurat多模态数据的附加功能
Seurat v4还包括用于分析、可视化和集成多模态数据集的附加功能。欲了解更多信息,请浏览以下资源:
- Defining cellular identity from multimodal data using WNN analysis in Seurat v4 vignette link
- Mapping scRNA-seq data onto CITE-seq references [vignette]
- Introduction to the analysis of spatial transcriptomics analysis [vignette]
- Analysis of 10x multiome (paired scRNA-seq + ATAC) using WNN analysis [vignette]
- Signac: Analysis, interpretation, and exploration of single-cell chromatin datasets [package]
- Mixscape: an analytical toolkit for pooled single-cell genetic screens [vignette]