单细胞自动注释之Garnett

在使用Seurat包得到降维聚类分群的结果后，就可以对数据进行注释了。Garnett是Cole Trapnell 实验室实验室开发的自动注释R包，使用机器学习算法来预测细胞类型，可以支持训练自己的分类器。

本文的步骤是：加载数据，创建CDS对象，数据预处理，根据自己的marker基因训练分类器，对细胞进行注释，并可视化结果。

代码主要来自官网https://cole-trapnell-lab.github.io/garnett/

1.R包和数据

library(Seurat)
library(tidyverse)
library(patchwork)
library(org.Hs.eg.db)
library(garnett)

这是一个Seurat对象，是先前由Seurat包执行标准流程所得到的，我使用的的示例数据可以在生信星球回复"953anno"获取。

rm(list=ls())
load("obj.Rdata")

2.创建CDS对象

CDS是monocle3包可以处理的对象，因为是同一个团队研发的，所以是同一个对象

data <- GetAssayData(sce.all, assay = 'RNA', layer = 'counts')
cell_metadata <- sce.all@meta.data
gene_annotation <- data.frame(gene_short_name = rownames(data))
rownames(gene_annotation) <- rownames(data)
cds <- new_cell_data_set(data,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)

3.预处理CDS对象

这一步相当于Seurat中的NormalizeData+ScaleData+RunPCA步骤。

现在使用的数据GSE146026，是log(TPM/10+1),所以在Seurat分析流程中，虽然是存在“counts”里面，但它是lognormalize之后的数据，所以在这里加了一个参数norm_method = “none”

经过我的一番深入搜索，米有找到该如何用harmony之后的结果来衔接，但是monocle3里面有个去除批次效应的函数align_cds，如果是多样本数据要整合那就要运行一下这一句。

cds <- preprocess_cds(cds, 
                      num_dim = 30,
                      norm_method = "none")
cds <- align_cds(cds, alignment_group = "orig.ident")

4.训练细胞分类器

要先检查下自己准备的marker基因

根据官网的要求，需要整理成这样的格式：

所以我的test.txt文件是这样的:

cat(readLines("test.txt"),sep = "\n")

## 
## >Bcells
## expressed: CD19, CD79A, CD79B 
## 
## >CAFs
## expressed: PDPN, DCN, THY1 
## 
## >DC
## expressed: CD1C, CD1E, CCR7, CD83 
## 
## >epithelial
## expressed: EPCAM 
## 
## >erythrocytes
## expressed: GATA1 
## 
## >macrophages
## expressed: CD14, AIF1, CSF1R, CD163 
## 
## >Tcells
## expressed: CD2, CD3D, CD3E, CD3G

marker_check <- check_markers(cds, marker_file = "test.txt",
                              db=org.Hs.eg.db,
                              cds_gene_id_type = "SYMBOL",
                              marker_file_gene_id_type = "SYMBOL")
plot_markers(marker_check)

这一步运行超级的慢,作用是根据自己的细胞与marker基因对应关系文件训练分类器。所以设置了一个存在即跳过。

class_file <- "my_classifier.Rdata"
if(!file.exists(class_file)){
  my_classifier <- train_cell_classifier(cds = cds,
                                           marker_file = "test.txt",
                                           db = org.Hs.eg.db,
                                           cds_gene_id_type = "SYMBOL",
                                           num_unknown = 50,
                                           marker_file_gene_id_type = "SYMBOL")
  save(my_classifier, file = class_file)
} else {
  load(class_file)
}

也可以在官网上面查查有适合直接用的就直接用

5.使用分类器对细胞进行分类

我们将使用训练好的分类器来对细胞进行分类，并使用cluster_extend = TRUE这个参数来试图对每个seurat做出来的细胞亚群分配一个类型。

看帮助文档里面有说：

Logical. When TRUE, the classifier provides a secondary cluster-extended classification, which assigns type for the entire cluster based on the assignments of the cluster members.

If the colData table of the input CDS has a column called “garnett_cluster”, this will be used for cluster-extended assignments.

所以我们在pdata中添加garnett_cluster列，内容就是seurat降维结果，这样就可以按照亚群来注释了，省的有些乱。

我试了一下只是这样设置还是会有一些亚群会被注释成多种类型，还有两个相关的参数

cluster_extend_max_frac_unknown: 默认为0.95，意味着允许一个cluster中最多95%的细胞是未知类型，仍然可以为整个cluster分配一个类型。 cluster_extend_max_frac_incorrect: 默认为0.1，意味着允许一个cluster中最多10%的细胞被错误分类，仍然可以为整个cluster分配一个类型。

而可以自行调整，如果两个都调成1就100%分配一个类型了。

pData(cds)$garnett_cluster <- pData(cds)$seurat_clusters
cds <- classify_cells(cds,
                      my_classifier, 
                      db = org.Hs.eg.db,
                      cluster_extend = TRUE,
                      cluster_extend_max_frac_incorrect = 1,
                      cds_gene_id_type = "SYMBOL")

6.可视化结果

可以用Seurat的DimPlot函数

cds.meta <- subset(pData(cds), select = c("cell_type", "cluster_ext_type")) %>% as.data.frame()
sce.all <- AddMetaData(sce.all, metadata = cds.meta)
p1 = DimPlot(sce.all, reduction = "umap", group.by = "cell_type") + theme_bw()
p2 = DimPlot(sce.all, reduction = "umap", group.by = "cluster_ext_type") + theme_bw()
p1+p2

也可以提取数据用ggplot2的函数来画

umap <- as.data.frame(sce.all@reductions[["umap"]]@cell.embeddings)
pd <- as.data.frame(pData(cds))
identical(rownames(umap),rownames(pd))

## [1] TRUE

data.umap <-cbind(umap,pd)

p1 <- ggplot(data.umap, aes(x = umap_1, y = umap_2, color = cell_type)) + 
  geom_point() + 
  scale_color_brewer(palette = "Set1")+
  theme_bw() + 
  labs(title = "Garnett Cell Type Allocation")

p2 <- ggplot(data.umap, aes(x = umap_1, y = umap_2, color = cluster_ext_type)) + 
  geom_point() + 
  scale_color_brewer(palette = "Set1")+
  theme_bw() + 
  labs(title = "Garnett Cluster Extended Type Allocation")

p1 + p2

通过上述步骤，我们完成了从数据加载到细胞类型注释的整个分析流程。这个流程可以帮助我们理解每个细胞亚群的生物学特性。