从TCGA数据库中下载数据的方法有多种,常用的方法有:
1.从TCGA官网直接下载(数据量小的话可行,数据量太大就pass)
2.用TCGA官方工具gdc-client下载
3.用R语言中的TCGAbiolinks包下载
之前下载临床数据时,先选择gdc-client工具下载,后来下载表达数据时老断,下载了几天仍然没下载成功,便采用TCGAbiolinks包下载(真的是个宝藏程序包,之前花了几天时间没下载成功的文件,用这个包分分钟就就下载好了)
接下来使用TCGAbiolinks包下载数据
> library(TCGAbiolinks)
> setwd("D:/breast_cancer/TCGA/Biolinks/expFPKM")
> query <- GDCquery(project = "TCGA-BRCA",
+ legacy = FALSE, #默认参数是FALSE时,下载hg38数据,否则下载hg19数据
+ experimental.strategy = "RNA-Seq",
+ data.category = "Transcriptome Profiling",
+ data.type = "Gene Expression Quantification",
+ workflow.type = "HTSeq - FPKM")
#出现下面的结果说明数据获取成功,否则重新获取
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-BRCA
--------------------
oo Filtering results
--------------------
ooo By experimental.strategy
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
#获取成功后就开始下载数据
> GDCdownload(query)
#下载成功后的结果
Downloading data for project TCGA-BRCA
GDCdownload will download 1222 files. A total of 635.710654 MB
Downloading as: Fri_Apr_16_11_16_59_2021.tar.gz
Downloading: 640 MB
补充:
workflow.type有三种类型:
HTSeq - FPKM-UQ:FPKM上四分位数标准化值
HTSeq - FPKM:FPKM值/表达量值
HTSeq - Counts:原始count数
从TCGA-xena上下载counts矩阵
dat = read.table("counts.tsv.gz",check.names = F,row.names = 1,header = T) #check.names = F是指在读取矩阵时,不检查列名,不把'-'当作减号处理
#逆转log
dat = as.matrix(2^dat - 1)
dat[1:4,1:4]
as.character(dat[1:100,1:10]) #有一些小数
# 用apply转换为整数矩阵
exp = apply(dat, 2, as.integer)
exp[1:4,1:4] #行名消失
rownames(exp) = rownames(dat)