TCGAbiolinks
包是一站式分析TCGA数据的R包工具,它集成了TCGA数据下载、分析、可视化的全部流程。此次系列笔记主要跟着 TCGAbiolinks帮助文档重新学习下TCGA数据挖掘流程。
- 官方文档:https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
- 文献:TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data https://pubmed.ncbi.nlm.nih.gov/26704973/
一、查找感兴趣的TCGA数据
GDCquery()
GDCquery(
project,
data.category,
data.type,
workflow.type,
legacy = FALSE,
access,
platform,
file.type,
barcode,
data.format,
experimental.strategy,
sample.type
)
1、可设置的参数
1.1、根据肿瘤类型
-
project
参数:指定一个或多个感兴趣的TCGA项目名 - 如下代码所示,供包括33种TCGA癌症类型
projects = TCGAbiolinks:::getGDCprojects()$project_id
TCGAs = grep("TCGA", projects, value = T)
sort(TCGAs)
# [1] "TCGA-ACC" "TCGA-BLCA" "TCGA-BRCA" "TCGA-CESC" "TCGA-CHOL" "TCGA-COAD"
# [7] "TCGA-DLBC" "TCGA-ESCA" "TCGA-GBM" "TCGA-HNSC" "TCGA-KICH" "TCGA-KIRC"
# [13] "TCGA-KIRP" "TCGA-LAML" "TCGA-LGG" "TCGA-LIHC" "TCGA-LUAD" "TCGA-LUSC"
# [19] "TCGA-MESO" "TCGA-OV" "TCGA-PAAD" "TCGA-PCPG" "TCGA-PRAD" "TCGA-READ"
# [25] "TCGA-SARC" "TCGA-SKCM" "TCGA-STAD" "TCGA-TGCT" "TCGA-THCA" "TCGA-THYM"
# [31] "TCGA-UCEC" "TCGA-UCS" "TCGA-UVM"
Study Abbreviation | Study Name | 中文名 |
---|---|---|
ACC | Adrenocortical carcinoma | 肾上腺皮质癌 |
BLCA | Bladder Urothelial Carcinoma | 膀胱尿路上皮癌 |
BRCA | Breast invasive carcinoma | 浸润性乳腺癌 |
CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 宫颈鳞状细胞癌和宫颈内腺癌 |
CHOL | Cholangiocarcinoma | 胆管癌 |
COAD | Colon adenocarcinoma | 结肠腺癌 |
DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 淋巴样肿瘤弥漫大b细胞淋巴瘤 |
ESCA | Esophageal carcinoma | 食管癌癌 |
GBM | Glioblastoma multiforme | 多形性成胶质细胞瘤 |
HNSC | Head and Neck squamous cell carcinoma | 头颈部鳞状细胞癌 |
KICH | Kidney Chromophobe | 肾嫌色细胞癌 |
KIRC | Kidney renal clear cell carcinoma | 肾透明细胞癌 |
KIRP | Kidney renal papillary cell carcinoma | 肾乳头状细胞癌 |
LAML | Acute Myeloid Leukemia | 急性髓系白血病 |
LGG | Brain Lower Grade Glioma | 脑低级别胶质瘤 |
LIHC | Liver hepatocellular carcinoma | 肝脏肝细胞癌 |
LUAD | Lung adenocarcinoma | 肺腺癌 |
LUSC | Lung squamous cell carcinoma | 肺鳞癌 |
MESO | Mesothelioma | 间皮瘤 |
OV | Ovarian serous cystadenocarcinoma | 卵巢浆液性囊腺癌 |
PAAD | Pancreatic adenocarcinoma | 胰腺腺癌 |
PCPG | Pheochromocytoma and Paraganglioma | 嗜铬细胞瘤和副神经节瘤 |
PRAD | Prostate adenocarcinoma | 前列腺腺癌 |
READ | Rectum adenocarcinoma | 直肠腺癌 |
SARC | Sarcoma | 肉瘤 |
SKCM | Skin Cutaneous Melanoma | 皮肤皮肤黑色素瘤 |
STAD | Stomach adenocarcinoma | 胃腺癌 |
TGCT | Testicular Germ Cell Tumors | 睾丸生殖细胞肿瘤 |
THCA | Thyroid carcinoma | 甲状腺癌 |
THYM | Thymoma | 胸腺瘤 |
UCEC | Uterine Corpus Endometrial Carcinoma | 子宫内膜癌 |
UCS | Uterine Carcinosarcoma | 子宫癌肉瘤 |
UVM | Uveal Melanoma | 葡萄膜黑色素瘤 |
1.2 hg19/hg38
- 主要根据参考基因组的不同,包含两套数据:GDC Legacy Archive【主要GRCh37 (hg19)】,GDC harmonized database【GRCh38 (hg38)】
- 通过设置参数
legacy
,默认为FALSE(hg19);TRUE则表示使用hg38参考基因组的测序数据。
1.3 下载数据类型
基于上述的参数,我们可以设置如下参数,交代我们的目标数据类型
-
data.category =
指定下载什么类型的数据:如组学数据、临床数据....
#查看某一种肿瘤所包含的数据类型
TCGAbiolinks:::getProjectSummary("TCGA-BRCA")$data_categories
# file_count case_count data_category
# 1 4679 1098 Sequencing Reads
# 2 1183 1098 Clinical
# 3 6627 1098 Copy Number Variation
# 4 5315 1098 Biospecimen
# 5 1234 1095 DNA Methylation
# 6 6080 1097 Transcriptome Profiling
# 7 8648 1044 Simple Nucleotide Variation
-
data.type =
更加细节的数据类型选择(optional) -
workflow.type =
同一个测序数据可能有不同的pipeline处理流程(optional, for harmonized ) -
platform =
测序平台(optional) -
file.type =
具体的数据文件(optional, for legacy)
如果不知道目标数据的上述信息,可以参考下面的概述
GDC harmonized database
Data.category | Data.type | Workflow.Type | Platform |
---|---|---|---|
Transcriptome Profiling | Gene Expression Quantification | HTSeq - Counts | |
Transcriptome Profiling | Gene Expression Quantification | HTSeq - FPKM | |
Transcriptome Profiling | Gene Expression Quantification | HTSeq - FPKM-UQ | |
Transcriptome Profiling | Gene Expression Quantification | STAR - Counts | |
Transcriptome Profiling | Isoform Expression Quantification | - | |
Transcriptome Profiling | miRNA Expression Quantification | - | |
Transcriptome Profiling | Splice Junction Quantification | ||
Copy number variation | Copy Number Segment | ||
Copy number variation | Masked Copy Number Segment | ||
Copy number variation | Gene Level Copy Number Scores | ||
Simple Nucleotide Variation | Masked Somatic Mutation | MuSE Variant Aggregation and Masking | |
Simple Nucleotide Variation | Masked Somatic Mutation | MuTect2 Variant Aggregation and Masking | |
Simple Nucleotide Variation | Masked Somatic Mutation | SomaticSniper Variant Aggregation and Masking | |
Simple Nucleotide Variation | Masked Somatic Mutation | VarScan2 Variant Aggregation and Masking | |
Raw Sequencing Data | - | ||
Biospecimen | Slide Image | ||
Biospecimen | Biospecimen Supplement | ||
Clinical | - | ||
DNA Methylation | Methylation Beta Value | Illumina Human Methylation 450 | |
DNA Methylation | Methylation Beta Value | Illumina Human Methylation 27 |
GDC Legacy Archive
Data.category | Data.type | Platform | file.type |
---|---|---|---|
Copy number variation | - | Affymetrix SNP Array 6.0 | nocnv_hg18.seg |
Copy number variation | - | Affymetrix SNP Array 6.0 | hg18.seg |
Copy number variation | - | Affymetrix SNP Array 6.0 | nocnv_hg19.seg |
Copy number variation | - | Affymetrix SNP Array 6.0 | hg19.seg |
Copy number variation | - | Illumina HiSeq | - |
Simple nucleotide variation | Simple somatic mutation | ||
Raw sequencing data | |||
Biospecimen | |||
Clinical | |||
Protein expression | MDA RPPA Core | - | |
Gene expression | Gene expression quantification | Illumina HiSeq | normalized_results |
Gene expression | Gene expression quantification | Illumina HiSeq | results |
Gene expression | Gene expression quantification | HT_HG-U133A | - |
Gene expression | Gene expression quantification | AgilentG4502A_07_2 | - |
Gene expression | Gene expression quantification | AgilentG4502A_07_1 | - |
Gene expression | Gene expression quantification | HuEx-1_0-st-v2 | FIRMA.txt |
Gene expression | Gene expression quantification | gene.txt | |
Gene expression | Isoform expression quantification | - | - |
Gene expression | miRNA gene quantification | - | hg19.mirna |
Gene expression | miRNA gene quantification | hg19.mirbase20 | |
Gene expression | miRNA gene quantification | mirna | |
Gene expression | Exon junction quantification | - | - |
Gene expression | Exon quantification | - | - |
Gene expression | miRNA isoform quantification | - | hg19.isoform |
Gene expression | miRNA isoform quantification | - | isoform |
DNA methylation | Illumina Human Methylation 450 | Not used | |
DNA methylation | Illumina Human Methylation 27 | Not used | |
DNA methylation | Illumina DNA Methylation OMA003 CPI | Not used | |
DNA methylation | Illumina DNA Methylation OMA002 CPI | Not used | |
DNA methylation | Illumina Hi Seq | ||
DNA methylation | Bisulfite sequence alignment | ||
DNA methylation | Methylation percentage | ||
DNA methylation | Aligned reads | ||
Raw microarray data | Raw intensities | Illumina Human Methylation 450 | idat |
Raw Microarray Data | Raw intensities | Illumina Human Methylation 27 | idat |
Structural Rearrangement | |||
Other |
1.4 样本标签Barcode
完整的barcode:形如 TCGA-G4-6317-02A-11D-2064-05,这个标签包含了从病人来源到测序过程、分析的所有信息,如下图所示比较重要的是Participant
、Sample
、Portion
三个部分,分别交代了病人编号、样本类型、测序类型
病人的id:形如 TCGA-G4-6317
样本来源的id:形如 TCGA-G4-6317-02
-
其中比较重要的是交代样本类型的
Sample
的两位数信息,是后面进行差异分析的分组依据。具体对应的含义如下。例如01
表示病人的原位瘤组织;11
表示来自病人的正常组织....
基于上述理解,我们也可以设置
sample.type =
参数指定下载感兴趣的样本类型数据,例如sample.type = "Primary Tumor"
对于给定的TCGA barcode,可以利用
TCGAquery_SampleTypes()
提取出目标分组的样本;TCGAquery_MatchedCoupledSampleTypes()
函数可以提取来自同一病人的配对样本数据。
query <- GDCquery(project = c("TCGA-BRCA"),
legacy = FALSE, #default(GDC harmonized database)
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 1222 29
query_info = getResults(query)
TP = TCGAquery_SampleTypes(query_info$sample.submitter_id,"TP")
NT = TCGAquery_SampleTypes(query_info$sample.submitter_id,"NT")
query <- GDCquery(project = c("TCGA-BRCA"),
legacy = FALSE, #default(GDC harmonized database)
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(TP, NT))
dim(getResults(query))
#[1] 1215 29
Pair_sample = TCGAquery_MatchedCoupledSampleTypes(query_info$sample.submitter_id,c("NT","TP"))
query <- GDCquery(project = c("TCGA-BRCA"),
legacy = FALSE, #default(GDC harmonized database)
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = Pair_sample)
dim(getResults(query))
#[1] 229 29
如上是查询TCGA目标数据的几种常见标准,还有几个参数没有介绍,可参看函数帮助文档。可根据自己的目的灵活设置上述参数。
2、query示例
2.1 胆管癌转录组数据 | hg19 | 所有样本
TCGAbiolinks:::getProjectSummary("TCGA-CHOL",legacy = TRUE)$data_categories
# file_count case_count data_category
# 1 30 30 Protein expression
# 2 680 36 Copy number variation
# 3 51 51 Biospecimen
# 4 444 36 Simple nucleotide variation
# 5 450 36 Gene expression
# 6 686 36 Raw microarray data
# 7 45 36 DNA methylation
# 8 193 51 Clinical
# 9 365 51 Raw sequencing data
query <- GDCquery(project = "TCGA-CHOL",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results")
dim(getResults(query))
#[1] 45 32
t(getResults(query)[1,])
# 1
# id "34216957-50e3-434c-8c38-72f0f2ddcf16"
# data_format "TXT"
# access "open"
# cases "TCGA-3X-AAV9-01A-72R-A41I-07"
# file_name "unc.edu.59012a78-0e8f-4b99-af97-0dbb1d3d0513.2538862.rsem.genes.normalized_results"
# submitter_id NA
# data_category "Gene expression"
# type "file"
# file_size 437196
# platform "Illumina HiSeq"
# state_comment NA
# tags character,3
# updated_datetime "2017-03-05T10:11:44.298823-06:00"
# md5sum "23836c9f9bdb053c567d91a67b62159d"
# file_id "34216957-50e3-434c-8c38-72f0f2ddcf16"
# data_type "Gene expression quantification"
# state "live"
# experimental_strategy "RNA-Seq"
# file_state "submitted"
# version "1"
# data_release "0.0 - 29.0"
# project "TCGA-CHOL"
# center_id "ee7a85b3-8177-5d60-a10c-51180eb9009c"
# center_center_type "CGCC"
# center_code "07"
# center_name "University of North Carolina"
# center_namespace "unc.edu"
# center_short_name "UNC"
# sample_type "Primary Tumor"
# is_ffpe FALSE
# cases.submitter_id "TCGA-3X-AAV9"
# sample.submitter_id "TCGA-3X-AAV9-01A"
2.2 肺腺癌癌转录组数据 | hg38 | 原位瘤+正常组织
TCGAbiolinks:::getProjectSummary("TCGA-LUAD",legacy = FALSE)$data_categories
# 4 2916 519 Transcriptome Profiling
query <- GDCquery(project = "TCGA-LUAD",
legacy = FALSE,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
dim(getResults(query))
#[1] 594 29
2.3 乳腺癌甲基化数据 | hg19 | Illumina Human Methylation 450平台
TCGAbiolinks:::getProjectSummary("TCGA-BRCA",legacy = TRUE)$data_categories
#7 1250 1097 DNA methylation
query <- GDCquery(project = "TCGA-BRCA",
legacy = TRUE,
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450")
dim(getResults(query))
#[1] 895 32
二、根据选择的query,下载数据
-
GDCdownload()
函数使用比较简单,指定我们上一步得到的query
即可。 - 提供两种下载方式:
api
与client
,前者较快,但有时不太稳定;后者较慢。推荐使用api
方式(default),当下载大文件时,可设置files.per.chunk = n
,表示分批下载,每批下载n个病人的数据,可避免中途报错,而前功尽弃。 -
directory
表示下载到哪个文件夹,默认会创建、下载到GDCdata文件夹
GDCdownload(
query,
token.file,
method = "api",
directory = "GDCdata",
files.per.chunk = NULL
)
- 示例数据
query <- GDCquery(project = "TCGA-CHOL",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
# Downloading data for project TCGA-CHOL
# GDCdownload will download 45 files. A total of 19.580796 MB
# Downloading chunk 1 of 5 (10 files, size = 4.351703 MB) as Wed_Aug_18_21_52_08_2021_0.tar.gz
# Downloading: 1.9 MB Downloading chunk 2 of 5 (10 files, size = 4.350318 MB) as Wed_Aug_18_21_52_08_2021_1.tar.gz
# Downloading: 1.8 MB Downloading chunk 3 of 5 (10 files, size = 4.351067 MB) as Wed_Aug_18_21_52_08_2021_2.tar.gz
# Downloading: 1.8 MB Downloading chunk 4 of 5 (10 files, size = 4.353528 MB) as Wed_Aug_18_21_52_08_2021_3.tar.gz
# Downloading: 1.9 MB Downloading chunk 5 of 5 (5 files, size = 2.17418 MB) as Wed_Aug_18_21_52_08_2021_4.tar.gz
# Downloading: 900 kB
三、读取已经下载到本地的文件到当前环境
-
GDCprepare()
会根据我们提供的query对象,以及下载数据的储存目录(默认也是GDCdata文件夹),完成数据读取的操作,以SummarizedExperiment
格式展示。 - 还可设置
save = TRUE
、filename = ****
参数,在读取数据后,自动将SummarizedExperiment对象保存为Rdata,以供之后方便调用(defalut
为FALSE)
query <- GDCquery(project = "TCGA-CHOL",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results")
GDCdownload(query, files.per.chunk = 10)
data <- GDCprepare(query, save = T, save.filename = "CHOL_RNAseq.rda")
# -------------------
# oo Reading 45 files
# -------------------
# |=================================================|100% Completed after 0 s
# -------------------
# oo Merging 45 files
# -------------------
# Starting to add information to samples
# => Add clinical information to samples
# => Adding TCGA molecular information from marker papers
# => Information will have prefix 'paper_'
# chol subtype information from:doi:10.1016/j.celrep.2017.02.033
# => Saving file: CHOL_RNAseq.rda
# => File saved
-
GDCprepare()
在读取数据的过程中,会自动进行样本信息、基因信息的注释。但目前这还不能支持全部类型数据。
library(SummarizedExperiment)
#表达矩阵信息
dim(assay(data))
#[1] 19947 45
assays(data)
# List of length 1
# names(1): normalized_count
assay(data, "normalized_count")[1:4,1:4]
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R-A41I-07
# A1BG 70.9581 29.9768 108409.2249 1485.0630
# A2M 23986.2548 8129.6961 98095.2358 7119.1570
# NAT1 72.4007 52.8682 160.2275 76.5504
# NAT2 8.7099 0.0000 1472.3868 23.2558
#样本(临床)信息
dim(colData(data))
#[1] 45 205
colData(data)[1:4,1:4]
# DataFrame with 4 rows and 4 columns
# barcode patient sample shortLetterCode
# <character> <character> <character> <character>
# TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAV9-01A-72R.. TCGA-3X-AAV9 TCGA-3X-AAV9-01A TP
# TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-3X-AAVC-01A-21R.. TCGA-3X-AAVC TCGA-3X-AAVC-01A TP
# TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-W5-AA2R-11A-11R.. TCGA-W5-AA2R TCGA-W5-AA2R-11A NT
# TCGA-ZH-A8Y4-01A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R.. TCGA-ZH-A8Y4 TCGA-ZH-A8Y4-01A TP
#不同的基因ID类型
dim(rowData(data))
#[1] 19947 3
rowData(data)[1:6,1:3]
# DataFrame with 6 rows and 3 columns
# gene_id entrezgene ensembl_gene_id
# <character> <integer> <character>
# A1BG A1BG 1 ENSG00000121410
# A2M A2M 2 ENSG00000175899
# NAT1 NAT1 9 ENSG00000171428
# NAT2 NAT2 10 ENSG00000156006
# RP11-986E7.7 RP11-986E7.7 12 ENSG00000273259
# AADAC AADAC 13 ENSG00000114771
#基因的坐标信息
rowRanges(data)
# GRanges object with 19947 ranges and 3 metadata columns:
# seqnames ranges strand | gene_id entrezgene ensembl_gene_id
# <Rle> <IRanges> <Rle> | <character> <integer> <character>
# A1BG chr19 58856544-58864865 - | A1BG 1 ENSG00000121410
# A2M chr12 9220260-9268825 - | A2M 2 ENSG00000175899
# NAT1 chr8 18027986-18081198 + | NAT1 9 ENSG00000171428
# NAT2 chr8 18248755-18258728 + | NAT2 10 ENSG00000156006
# RP11-986E7.7 chr14 95058395-95090983 + | RP11-986E7.7 12 ENSG00000273259
# ... ... ... ... . ... ... ...
# RASAL2-AS1 chr1 178060643-178063119 - | RASAL2-AS1 100302401 ENSG00000224687
# LINC00882 chr3 106555658-106959488 - | LINC00882 100302640 ENSG00000242759
# FTX chrX 73183790-73513409 - | FTX 100302692 ENSG00000230590
# TICAM2 chr5 114914339-114961876 - | TICAM2 100302736 ENSG00000243414
# SLC25A5-AS1 chrX 118599997-118603061 - | SLC25A5-AS1 100303728 ENSG00000224281
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
以上就是查找数据,下载数据,读取数据的全部流程,接下来就可以开始分析数据了~
补充:关于病人的临床数据与肿瘤分型
1、获取病人的临床数据
- 如上在
GDCprepare()
过程中,会自动注释病人样本的临床信息。 - 我们也可以预先单独下载每个病人的临床数据,以供参考。
方法一:GDCquery() pipeline
query <- GDCquery(project = "TCGA-ACC",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query, files.per.chunk = 20)
clinical.BCRtab.all <- GDCprepare(query)
grep("clinical_", names(clinical.BCRtab.all), value = T)
# [1] "clinical_drug_brca" "clinical_omf_v4.0_brca"
# [3] "clinical_follow_up_v4.0_brca" "clinical_follow_up_v1.5_brca"
# [5] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"
# [7] "clinical_radiation_brca" "clinical_nte_brca"
# [9] "clinical_follow_up_v2.1_brca"
clinical_patient_brca = as.data.frame(clinical.BCRtab.all$clinical_patient_brca)
clinical_patient_brca[1:4,1:4]
# bcr_patient_uuid bcr_patient_barcode form_completion_date prospective_collection
# 1 bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_collection_indicator
# 2 CDE_ID: CDE_ID:2003301 CDE_ID: CDE_ID:3088492
# 3 6E7D5EC6-A469-467C-B748-237353C23416 TCGA-3C-AAAU 2014-1-13 NO
# 4 55262FCB-1B01-4480-B322-36570430C917 TCGA-3C-AALI 2014-7-28 NO
方法二:GDCquery_clinic()
- 根据官方介绍,这个函数下载的是indexed clinical: a refined clinical data that is created using the XML files(方法一).
- 这种方法下载速度较快,建议优先使用。如果没有想要的信息,再使用方法一。
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
clinical[1:4,1:4]
# submitter_id synchronous_malignancy ajcc_pathologic_stage tumor_stage
# 1 TCGA-E2-A14U No Stage I stage i
# 2 TCGA-E9-A1RC No Stage IIIC stage iiic
# 3 TCGA-D8-A1J9 No Stage IA stage ia
# 4 TCGA-E2-A14P No Stage IIIC stage iiic
2、获取病人的肿瘤分型
-
PanCancerAtlas_subtypes()
The columns “Subtype_Selected” was selected as most prominent subtype classification (from the other columns)
subtypes <- PanCancerAtlas_subtypes()
dim(subtypes)
#[1] 7734 10
table(subtypes$cancer.type)
# ACC AML BLCA BRCA COAD ESCA GBM HNSC KICH KIRC KIRP LGG LIHC LUAD LUSC OVCA PCPG
# 91 187 129 1218 341 169 606 279 66 442 161 516 196 230 178 489 178
# PRAD READ SKCM STAD THCA UCEC UCS
# 333 118 333 383 496 538 57
head(as.data.frame(subtypes))
# pan.samplesID cancer.type Subtype_mRNA Subtype_DNAmeth Subtype_protein Subtype_miRNA Subtype_CNA Subtype_Integrative Subtype_other Subtype_Selected
# 1 TCGA-OR-A5J1 ACC steroid-phenotype-high+proliferation CIMP-high NA miRNA_1 Quiet COC3 C1A ACC.CIMP-high
# 2 TCGA-OR-A5J2 ACC steroid-phenotype-high+proliferation CIMP-low 1 miRNA_1 Noisy COC3 C1A ACC.CIMP-low
# 3 TCGA-OR-A5J3 ACC steroid-phenotype-high CIMP-intermediate 3 miRNA_6 Chromosomal COC2 C1A ACC.CIMP-intermediate
# 4 TCGA-OR-A5J4 ACC <NA> CIMP-high NA miRNA_6 Chromosomal <NA> <NA> ACC.CIMP-high
# 5 TCGA-OR-A5J5 ACC steroid-phenotype-high CIMP-intermediate NA miRNA_2 Chromosomal COC2 C1A ACC.CIMP-intermediate
# 6 TCGA-OR-A5J6 ACC steroid-phenotype-low CIMP-low 2 miRNA_1 Noisy COC1 C1B ACC.CIMP-low
-
TCGAquery_subtype()
These subtypes will be automatically added in the summarizedExperiment object through GDCprepare. But you can also use the TCGAquery_subtype function to retrieve this information.
brca.subtype <- TCGAquery_subtype(tumor = "brca")
t(brca.subtype[1,])
# [,1]
# patient "TCGA-3C-AAAU"
# Tumor.Type "BRCA"
# Included_in_previous_marker_papers "NO"
# vital_status "Alive"
# days_to_birth "-20211"
# days_to_death "NA"
# days_to_last_followup "4047"
# age_at_initial_pathologic_diagnosis "55"
# pathologic_stage "NA"
# Tumor_Grade "NA"
# BRCA_Pathology "NA"
# BRCA_Subtype_PAM50 "LumA"
# MSI_status "NA"
# HPV_Status "NA"
# tobacco_smoking_history "NA"
# CNV Clusters "C6"
# Mutation Clusters "C7"
# DNA.Methylation Clusters "C1"
# mRNA Clusters "C1"
# miRNA Clusters "C3"
# lncRNA Clusters "NA"
# Protein Clusters "NA"
# PARADIGM Clusters "C5"
# Pan-Gyn Clusters "NA"
GDCquery_Maf()
函数可以支持下载突变数据,这里就暂时不学习了。之后有机会再了解一下。