1. matrix.mtx
做过10xGenomics单细胞转录组的人都知道cellranger的标准输出是三个文件(barcodes.tsv genes.tsv matrix.mtx),由这三个文件组成表达矩阵,其实这么做的原因就是为了节省空间以及加快导入速度,matrix是一种十分有效的数据存储方式,一般可以通过Matrix::readMM读取matrix矩阵,一直以来,我以为cellranger输出的matrix.mtx矩阵与其他的矩阵没有区别的,但是最近在做另外的一个软件kallistobustools比对的时候发现有点不对劲,由于其输出结果也是三个矩阵,但是有个前缀,我想着对Seurat的Read10X的函数进行改写一下,增加前缀、后缀的参数就可以了,结果改完了以后,始终报如下的错误:
Error in dimnamesGets(x, value) :
invalid dimnames given for “dgTMatrix” object
开始以为是我的输入文件的问题,我检查了genes.tsv文件,发现kallistobustools只有一列,因此我有添加了一列,结果还发现报错,然后想看看matrix.mtx文件,发现其行数和列数也能对上,怎么就不对呢?这个问题弄了很久,后面实在没法了,就一行一行的运行,后面发现在给矩阵列名的时候有问题。
既然这里报错了,那就看看这两个文件的区别,后我通过dim(data),发现行列数目对不上,我这才去认真的检测两个matrix.mtx的区别。
cell.names <- readLines(barcode.loc)
data <- readMM(file = matrix.loc)
colnames(x = data) <- cell.names
dim(data)
#[1] 1421455 27623
length(cell.names)
#[1] 1421455
通过上面的检查,发现是行和列不一致造成的报错,但是我试了一下cellranger标准的输出结果,结果没有任何报错,那么现在出现报错,那么就只能是matrix.mtx有问题了。下面弄了一个标准的10x的cellranger输出的matrix.mtx文件进行测试:
cell.names_10x <- readLines(barcode.loc)
data_10x <- readMM(file = matrix.loc)
dim(data_10x)
#[1] 28692 2270
length(cell.names_10x)
#[1] 2270
colnames(x = data_10x) <- cell.names_10x
这个就完全没有报错,然后再仔细对比一下cellranger输出的matrix.mtx和kallistobustools输出的matrix.mtx,注意看一下图片:
发现了上面的不同之处不?cellranger输出的matrix.mtx除了%以外的第一行的三个数分别代表:基因、barcode、matrix.mtx矩阵的行数;而kallistobustools输出的matrix.mtx的这三列分别代表:barcode、基因、matrix.mtx矩阵的行数,这里是有巨大的不同,因此这个才会造成前面的报错。其实这个不同二者可以通过转置进行转换,因此如果要利用Read10X进行读入此数据的话,这里需要转置一下。之所以不同,估计是cellranger输出的时候,对数据已经提前转置过,然后后续不要再转置了。
2. 修改后读入的函数
下面是对Seurat::Read10X函数进行改写,支持了三个矩阵的前缀、后缀参数,比如三个文件如下所示:
通过改造后的函数进行读取数据,由于gene.txt文件只有一列,因此设置基因名称为gene.txt的第一列,具体的命令如下:
y<-read10x_ymc(indir,gene.column =1,prefix='gene.',suffix='txt')
read10x_ymc <- function(
data.dir = NULL,
gene.column = 2,
unique.features = TRUE,
strip.suffix = FALSE,
prefix='',
suffix="tsv"
) {
full.data <- list()
for (i in seq_along(along.with = data.dir)) {
run <- data.dir[i]
if (!dir.exists(paths = run)) {
stop("Directory provided does not exist")
}
barcode.loc <- file.path(run, paste(prefix,'barcodes.',suffix,sep=''))
gene.loc <- file.path(run, paste(prefix,'genes.',suffix,sep=''))
features.loc <- file.path(run, paste(prefix,'features.',suffix,'.gz',sep=''))
if( file.exists(file.path(run,paste(prefix,'matrix.mtx',sep='')))){matrix.loc <- file.path(run,prefix,paste(prefix,'matrix.mtx',sep=''))}else{matrix.loc <- file.path(run,paste(prefix,'mtx',sep=''))}
# Flag to indicate if this data is from CellRanger >= 3.0
pre_ver_3 <- file.exists(gene.loc)
if (!pre_ver_3) {
addgz <- function(s) {
return(paste0(s, ".gz"))
}
barcode.loc <- addgz(s = barcode.loc)
matrix.loc <- addgz(s = matrix.loc)
}
if (!file.exists(barcode.loc)) {
stop("Barcode file missing. Expecting ", basename(path = barcode.loc))
}
if (!pre_ver_3 && !file.exists(features.loc) ) {
stop("Gene name or features file missing. Expecting ", basename(path = features.loc))
}
if (!file.exists(matrix.loc)) {
print(matrix.loc)
stop("Expression matrix file missing. Expecting ", basename(path = matrix.loc))
}
data <- readMM(file = matrix.loc)
cell.names <- readLines(barcode.loc)
if(!dim(data)[2]==length(cell.names)){data <- t(readMM(file = matrix.loc))}###这是与cellranger输出结果最大的不同
if (all(grepl(pattern = "\\-1$", x = cell.names)) & strip.suffix) {
cell.names <- as.vector(x = as.character(x = sapply(
X = cell.names,
FUN = ExtractField,
field = 1,
delim = "-"
)))
}
if (is.null(x = names(x = data.dir))) {
if (i < 2) {
colnames(x = data) <- cell.names
} else {
colnames(x = data) <- paste0(i, "_", cell.names)
}
} else {
colnames(x = data) <- paste0(names(x = data.dir)[i], "_", cell.names)
}
feature.names <- read.delim(
file = ifelse(test = pre_ver_3, yes = gene.loc, no = features.loc),
header = FALSE,
stringsAsFactors = FALSE
)
if (any(is.na(x = feature.names[, gene.column]))) {
warning(
'Some features names are NA. Replacing NA names with ID from the opposite column requested',
call. = FALSE,
immediate. = TRUE
)
na.features <- which(x = is.na(x = feature.names[, gene.column]))
replacement.column <- ifelse(test = gene.column == 2, yes = 1, no = 2)
feature.names[na.features, gene.column] <- feature.names[na.features, replacement.column]
}
if (unique.features) {
fcols = ncol(x = feature.names)
if (fcols < gene.column) {
stop(paste0("gene.column was set to ", gene.column,
" but feature.tsv.gz (or genes.tsv) only has ", fcols, " columns.",
" Try setting the gene.column argument to a value <= to ", fcols, "."))
}
rownames(x = data) <- make.unique(names = feature.names[, gene.column])
}
# In cell ranger 3.0, a third column specifying the type of data was added
# and we will return each type of data as a separate matrix
if (ncol(x = feature.names) > 2) {
data_types <- factor(x = feature.names$V3)
lvls <- levels(x = data_types)
if (length(x = lvls) > 1 && length(x = full.data) == 0) {
message("10X data contains more than one type and is being returned as a list containing matrices of each type.")
}
expr_name <- "Gene Expression"
if (expr_name %in% lvls) { # Return Gene Expression first
lvls <- c(expr_name, lvls[-which(x = lvls == expr_name)])
}
data <- lapply(
X = lvls,
FUN = function(l) {
return(data[data_types == l, , drop = FALSE])
}
)
names(x = data) <- lvls
} else{
data <- list(data)
}
full.data[[length(x = full.data) + 1]] <- data
}
# Combine all the data from different directories into one big matrix, note this
# assumes that all data directories essentially have the same features files
list_of_data <- list()
for (j in 1:length(x = full.data[[1]])) {
list_of_data[[j]] <- do.call(cbind, lapply(X = full.data, FUN = `[[`, j))
# Fix for Issue #913
print('yaomengcheng')
list_of_data[[j]] <- as(object = list_of_data[[j]], Class = "dgCMatrix")
}
names(x = list_of_data) <- names(x = full.data[[1]])
# If multiple features, will return a list, otherwise
# a matrix.
if (length(x = list_of_data) == 1) {
return(list_of_data[[1]])
} else {
return(list_of_data)
}
}