资料来源该包手册。R社区已经有不少下载和处理TCGA数据的包,但目前能获取最新GRch38的应该只有TCGAbiolinks和本包,本包也是TCGA开发的官方R包,值得信赖。 比较麻烦的就是文件,样本ID的转换。
GDC是什么?
来自官网Genomic Data Commons (GDC) website的信息:
The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs.
The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared.
As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.
The data model for the GDC is complex, but it worth a quick overview. The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details. The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.
快速开始
这个软件尚在开发之中,希望收到用户的反馈。如果想要报告bug或者问题,要么submit a new issue或者在R中使用bug.report(package='GenomicDataCommons')
。
安装
从bioconductor:
source('https://bioconductor.org/biocLite.R')
biocLite('GenomicDataCommons')
导入:
library(GenomicDataCommons)
查看基本的特性
GenomicDataCommons::status()
## $commit
## [1] "e9e20d6f97f2bf6dd3b3261e36ead57c56a4c7cc"
##
## $data_release
## [1] "Data Release 12.0 - June 13, 2018"
##
## $status
## [1] "OK"
##
## $tag
## [1] "1.14.1"
##
## $version
## [1] 1
If this statement results in an error such as
SSL connect error
, see the troubleshooting section below.
寻找数据
下面的代码构建了一个manifest
用来引导原始数据的下载,使用HTSeq
查找和过滤Ovarian Cancer的基因表达的原始计数。
library(magrittr)
ge_manifest = files() %>%
filter( ~ cases.project.project_id == 'TCGA-OV' &
type == 'gene_expression' &
analysis.workflow_type == 'HTSeq - Counts') %>%
manifest()
下载数据
下面的代码块下载 379 个基因表达数据文件。使用多进程进行下载可以极快地提高下载速度。
destdir = tempdir()
fnames = lapply(ge_manifest$id[1:20],gdcdata)
如果下载的数据包含控制data,下载时需要包括一个token
(有下载权限的标志)。具体请看the authentication section below.
元数据获取
expands = c("diagnoses","annotations",
"demographic","exposures")
clinResults = cases() %>%
GenomicDataCommons::select(NULL) %>%
GenomicDataCommons::expand(expands) %>%
results(size=50)
str(clinResults,list.len=10)
## List of 4
## $ diagnoses :List of 50
## ..$ 3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "metastasis"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Lung, NOS"
## .. ..$ created_datetime : chr "2017-06-19T09:21:58.285024-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Adenocarcinoma, NOS"
## .. ..$ submitter_id : chr "AD7376_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Liver"
## .. .. [list output truncated]
## ..$ 36a29f50-5081-4a45-ab83-b11164e6781a:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "metastasis"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Lung, NOS"
## .. ..$ created_datetime : chr "2017-06-16T16:15:26.111532-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Small cell carcinoma, NOS"
## .. ..$ submitter_id : chr "AD15003_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Lymph node, NOS"
## .. .. [list output truncated]
## ..$ b8ed18f9-2582-4454-a5d8-529370d9da2b:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "Unknown"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Head, face or neck, NOS"
## .. ..$ created_datetime : chr "2017-06-16T16:04:08.462038-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Squamous cell carcinoma, NOS"
## .. ..$ submitter_id : chr "AD4846_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Lung, NOS"
## .. .. [list output truncated]
## ..$ fd9ffb79-7d96-4a14-955e-236011d283b6:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "primary"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Colon, NOS"
## .. ..$ created_datetime : chr "2017-06-19T09:25:18.484242-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Adenocarcinoma, NOS"
## .. ..$ submitter_id : chr "AD8471_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Colon, NOS"
## .. .. [list output truncated]
## ..$ 672fdc0b-0844-4bbf-8b3a-1bff693d17fc:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "primary"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Stomach, NOS"
## .. ..$ created_datetime : chr "2017-06-16T16:04:28.356056-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Adenocarcinoma, NOS"
## .. ..$ submitter_id : chr "AD880_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Stomach, NOS"
## .. .. [list output truncated]
## ..$ 3468b376-b05e-40e7-851b-25bb9251378f:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "Unknown"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Unknown"
## .. ..$ created_datetime : chr "2017-06-16T16:05:07.197842-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Adenocarcinoma, NOS"
## .. ..$ submitter_id : chr "AD103_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Not Reported"
## .. .. [list output truncated]
## ..$ 2d29df1c-a548-4e95-8206-3c473e9ffdca:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "metastasis"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Lung, NOS"
## .. ..$ created_datetime : chr "2017-06-16T15:43:06.585011-05:00"
## .. ..$ updated_datetime : chr "2018-01-29T14:25:29.405142-06:00"
## .. ..$ primary_diagnosis : chr "Adenocarcinoma, NOS"
## .. ..$ submitter_id : chr "AD2816_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Connective, subcutaneous and other soft tissues, NOS"
## .. .. [list output truncated]
## ..$ 5bd74091-bef3-49a1-b617-45167f02baa6:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "Unknown"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Unknown"
## .. ..$ created_datetime : chr "2017-06-16T15:39:59.533998-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Adenocarcinoma, NOS"
## .. ..$ submitter_id : chr "AD10160_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Bone, NOS"
## .. .. [list output truncated]
## ..$ 48758eeb-6d56-41ff-b0f5-e2e5bfa3e63a:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "metastasis"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Anus, NOS"
## .. ..$ created_datetime : chr "2017-06-19T09:33:40.540869-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Squamous cell carcinoma, NOS"
## .. ..$ submitter_id : chr "AD9751_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Brain, NOS"
## .. .. [list output truncated]
## ..$ 2bac2a54-5843-498d-bfeb-6d795ba54665:'data.frame': 1 obs. of 19 variables:
## .. ..$ progression_or_recurrence : chr "not reported"
## .. ..$ classification_of_tumor : chr "primary"
## .. ..$ last_known_disease_status : chr "not reported"
## .. ..$ tumor_grade : chr "not reported"
## .. ..$ tissue_or_organ_of_origin : chr "Kidney, NOS"
## .. ..$ created_datetime : chr "2017-06-16T15:43:06.585011-05:00"
## .. ..$ updated_datetime : chr "2017-10-13T13:38:40.036812-05:00"
## .. ..$ primary_diagnosis : chr "Papillary renal cell carcinoma"
## .. ..$ submitter_id : chr "AD2814_diagnosis"
## .. ..$ site_of_resection_or_biopsy : chr "Kidney, NOS"
## .. .. [list output truncated]
## .. [list output truncated]
## $ case_id : chr [1:50] "3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6" "36a29f50-5081-4a45-ab83-b11164e6781a" "b8ed18f9-2582-4454-a5d8-529370d9da2b" "fd9ffb79-7d96-4a14-955e-236011d283b6" ...
## $ demographic:'data.frame': 50 obs. of 8 variables:
## ..$ updated_datetime: chr [1:50] "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" ...
## ..$ created_datetime: chr [1:50] "2017-06-19T11:44:09.223033-05:00" "2017-06-19T11:42:12.871583-05:00" "2017-06-19T11:33:07.960036-05:00" "2017-06-19T11:45:29.124685-05:00" ...
## ..$ gender : chr [1:50] "male" "male" "female" "female" ...
## ..$ submitter_id : chr [1:50] "AD7376_demographic" "AD15003_demographic" "AD4846_demographic" "AD8471_demographic" ...
## ..$ state : chr [1:50] "submitted" "submitted" "submitted" "submitted" ...
## ..$ race : chr [1:50] "not reported" "not reported" "not reported" "not reported" ...
## ..$ demographic_id : chr [1:50] "39e8ea37-69ca-4541-b26c-4dbaa5cd8b77" "3a456586-6198-44d7-9dfa-fb510796d3ad" "ec055724-878d-4c2f-881c-aae7fcf8e5aa" "6f286488-6b1d-432c-8014-bacf68423fd6" ...
## ..$ ethnicity : chr [1:50] "not reported" "not reported" "not reported" "not reported" ...
## $ id : chr [1:50] "3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6" "36a29f50-5081-4a45-ab83-b11164e6781a" "b8ed18f9-2582-4454-a5d8-529370d9da2b" "fd9ffb79-7d96-4a14-955e-236011d283b6" ...
## - attr(*, "row.names")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"
基本设计
这个包的设计跟dplyr
的"hadleyverse"方法有点类似。大致上,寻找和获取文件、元数据的函数可以分为:
- 基于GDC API终点的简单查询构建器
- 应用,修正过滤,字段选择和分面的动词集合并生成一个新的查询对象
- 执行查询和从GDC返回结果的动词集合
另外还有一些询问GDC API信息与可获取的字段,索引BAM文件,下载实际数据文件的函数。
下面是一个概览[1]:
- Creating a query
projects()
cases()
files()
annotations()
- Manipulating a query
filter()
facet()
select()
- Introspection on the GDC API fields
mapping()
available_fields()
default_fields()
grep_fields()
field_picker()
available_values()
available_expand()
- Executing an API call to retrieve query results
results()
count()
response()
- Raw data file downloads
gdcdata()
transfer()
gdc_client()
- Summarizing and aggregating field values (faceting)
aggregations()
- Authentication
gdc_token()
- BAM file slicing
slicing()
使用
处理NCI GDC是存在两类操作。
- 查询元数据和查找数据文件 (例如,为某类癌症病人查找所有的基因表达定量数据文件)
- 从GDC传输原始或处理过的数据到另一台电脑
这两类操作在下面进行详述。
查询元数据
大量关于病人、文件、项目和所谓注释的元数据都可以通过NCI GDC API获取。通常,我们想要查询元数据,然后进行下载或者执行所谓的聚合操作(和table()
类似的功能)
首先创建一个空查询获取元数据。我们然后经常想要filter
查询,对retrieving results提前进行一些限制。GenomicDataCommons
包有列出条目的帮助函数用来帮助过滤。
创建查询
下面4个函数可以创建GDCQuery
对象用来查询元数据:
projects()
cases()
files()
annotations()
pquery = projects()
pquery
对象现在是一个S3类,GDCQuery
。对象包含下面一些元素:
- 字段:这是一个关于字段的字符串向量,在检索数据时会返回。如果没有指定字段,会使用默认字段 (查看
default_fields()
)。 - filters: 在使用
filter()
方法后会返回结果用于对后续的结果检索进行过滤。 - facets: 一个字段的字符串向量,在聚合数据(
aggreations()
)时使用。 - archive: 要么“default”要么“legacy”。
- token: 来自GDC的token字符串。查看the authentication section获取详情,注意,通常检索原始数据是不需要权限的,有些数据的下载需要。
查看实际的对象(习惯使用str()
!),注意查询不会包含结果。
str(pquery)
## List of 5
## $ fields : chr [1:16] "awg_review" "dbgap_accession_number" "disease_type" "in_review" ...
## $ filters: NULL
## $ facets : NULL
## $ legacy : logi FALSE
## $ expand : NULL
## - attr(*, "class")= chr [1:3] "gdc_projects" "GDCQuery" "list"
检索结果
[ GDC分页文档]
如果构建好了查询对象,下一步是从GDC检索结果。检索结果的最基本类型是满足条件的、可获取记录的简单counts()
。注意我们刚才并未设置如何过滤,所以count()
会返回所有满足标准的项目记录。
pcount = count(pquery)
# 或者
pcount = pquery %>% count()
pcount
## [1] 40
results()
方法会取回实际的结果:
presults = pquery %>% results()
这些从GDC返回的结果都以JSON格式存储,函数自动将它转换为R里面的嵌套列表。
str(presults)
## List of 8
## $ dbgap_accession_number: chr [1:10] "phs001179" "phs000470" NA NA ...
## $ disease_type :List of 10
## ..$ FM-AD : chr [1:23] "Germ Cell Neoplasms" "Acinar Cell Neoplasms" "Miscellaneous Tumors" "Thymic Epithelial Neoplasms" ...
## ..$ TARGET-RT: chr "Rhabdoid Tumor"
## ..$ TCGA-UCS : chr "Uterine Carcinosarcoma"
## ..$ TCGA-LUSC: chr "Lung Squamous Cell Carcinoma"
## ..$ TCGA-BRCA: chr "Breast Invasive Carcinoma"
## ..$ TCGA-SKCM: chr "Skin Cutaneous Melanoma"
## ..$ TARGET-OS: chr "Osteosarcoma"
## ..$ TCGA-THYM: chr "Thymoma"
## ..$ TARGET-WT: chr "High-Risk Wilms Tumor"
## ..$ TCGA-ESCA: chr "Esophageal Carcinoma"
## $ released : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ state : chr [1:10] "open" "open" "open" "open" ...
## $ primary_site :List of 10
## ..$ FM-AD : chr [1:42] "Kidney" "Testis" "Unknown" "Other and unspecified parts of biliary tract" ...
## ..$ TARGET-RT: chr "Kidney"
## ..$ TCGA-UCS : chr "Uterus"
## ..$ TCGA-LUSC: chr "Lung"
## ..$ TCGA-BRCA: chr "Breast"
## ..$ TCGA-SKCM: chr "Skin"
## ..$ TARGET-OS: chr "Bone"
## ..$ TCGA-THYM: chr "Thymus"
## ..$ TARGET-WT: chr "Kidney"
## ..$ TCGA-ESCA: chr "Esophagus"
## $ project_id : chr [1:10] "FM-AD" "TARGET-RT" "TCGA-UCS" "TCGA-LUSC" ...
## $ id : chr [1:10] "FM-AD" "TARGET-RT" "TCGA-UCS" "TCGA-LUSC" ...
## $ name : chr [1:10] "Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)" "Rhabdoid Tumor" "Uterine Carcinosarcoma" "Lung Squamous Cell Carcinoma" ...
## - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
## - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"
默认只返回10条记录,我们可以对results()
添加size
和from
参数改变数目。这里存在一个简单的方法,results_all()
会返回所有可获得的结果。小心使用这个函数,它可以消耗非常长的时间,数据巨大。
length(ids(presults))
## [1] 10
presults = pquery %>% results_all()
length(ids(presults))
## [1] 40
# 包含所有的记录
length(ids(presults)) == count(pquery)
## [1] TRUE
抽取结果的子集或者将结果变成更通用的R数据结构不是很简单,然后可以借助 purrr、 rlist、 和data.tree。
想要交互式浏览结果,使用listviewer包。
字段和值
查询和检索GDC数据的中心是指定想要返回的字段、根据字段与值进行过滤、分面或者聚合。本包包含两个简单地函数,available_fields()
与default_fields()
。每个都可以操作“cases”, “files”, “annotations”, “projects”或者GDCQuery
对象。
default_fields('files')
## [1] "access" "acl"
## [3] "batch_id" "created_datetime"
## [5] "data_category" "data_format"
## [7] "data_type" "error_type"
## [9] "experimental_strategy" "file_autocomplete"
## [11] "file_id" "file_name"
## [13] "file_size" "file_state"
## [15] "imaging_date" "magnification"
## [17] "md5sum" "origin"
## [19] "platform" "read_pair_number"
## [21] "revision" "state"
## [23] "state_comment" "submitter_id"
## [25] "tags" "type"
## [27] "updated_datetime"
# The number of fields available for files endpoint
length(available_fields('files'))
## [1] 703
# The first few fields available for files endpoint
head(available_fields('files'))
## [1] "access" "acl"
## [3] "analysis.analysis_id" "analysis.analysis_type"
## [5] "analysis.batch_id" "analysis.created_datetime"
字段可以通过类似dplyr
包流程的方式指定,select()
函数是重设GDCQuery
对象字段槽的动词;注意这与dplyr限制已有字段不同。
# 这里是默认字段
qcases = cases()
qcases$fields
## [1] "aliquot_ids" "analyte_ids"
## [3] "batch_id" "case_autocomplete"
## [5] "case_id" "created_datetime"
## [7] "days_to_index" "days_to_lost_to_followup"
## [9] "disease_type" "index_date"
## [11] "lost_to_followup" "portion_ids"
## [13] "primary_site" "sample_ids"
## [15] "slide_ids" "state"
## [17] "submitter_aliquot_ids" "submitter_analyte_ids"
## [19] "submitter_id" "submitter_portion_ids"
## [21] "submitter_sample_ids" "submitter_slide_ids"
## [23] "updated_datetime"
# 使用所有的字段
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)
## [1] "case_id" "aliquot_ids"
## [3] "analyte_ids" "annotations.annotation_id"
## [5] "annotations.batch_id" "annotations.case_id"
grep_fields()
与field_picker()
可以操作寻找感兴趣的字段。
分面与聚合
有点类似R的table
方法,GDC API提供了称为聚合或分面的操作,通过指定一个或多个字段,GDC会返回所有可能值的计数。
# 指定文件类型的综述
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type
## key doc_count
## 1 simple_somatic_mutation 64015
## 2 annotated_somatic_mutation 63580
## 3 aligned_reads 45985
## 4 copy_number_segment 44752
## 5 gene_expression 34713
## 6 slide_image 30036
## 7 biospecimen_supplement 25151
## 8 mirna_expression 22976
## 9 clinical_supplement 12496
## 10 methylation_beta_value 12359
## 11 aggregated_somatic_mutation 186
## 12 masked_somatic_mutation 132
Filtering
The GenomicDataCommons package 使用了一种非标准的评估形式来指定类似于R的查询,然后翻译为R列表。这个R表达式使用了公式接口,如Hadley Wickham 在vignette on non-standard evaluation中所建议。
It’s best to use a formula because a formula captures both the expression to evaluate and the environment where the evaluation occurs. This is important if the expression is a mixture of variables in a data frame and objects in the local environment [for example].
对于用户来说不需要关注很多底层细节,除了注意过滤表达式必须以~
开始。
qfiles = files()
qfiles %>% count() # all files
## [1] 356381
过滤文件类型为基因表达:
qfiles = files() %>% filter(~ type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))
## List of 2
## $ op : 'scalar' chr "="
## $ content:List of 2
## ..$ field: chr "type"
## ..$ value: chr "gene_expression"
要是我们想创建一个基于项目(比如“TCGA-OVCA”)的过滤该怎么办?我们有一些方法可以发现可获取的字段。
第一种是基于一些基本的R函数和直觉。
grep('pro',available_fields('files'),value=TRUE)
## [1] "cases.diagnoses.progression_free_survival"
## [2] "cases.diagnoses.progression_free_survival_event"
## [3] "cases.diagnoses.progression_or_recurrence"
## [4] "cases.project.awg_review"
## [5] "cases.project.dbgap_accession_number"
## [6] "cases.project.disease_type"
## [7] "cases.project.in_review"
## [8] "cases.project.intended_release_date"
## [9] "cases.project.is_legacy"
## [10] "cases.project.name"
## [11] "cases.project.primary_site"
## [12] "cases.project.program.dbgap_accession_number"
## [13] "cases.project.program.name"
## [14] "cases.project.program.program_id"
## [15] "cases.project.project_id"
## [16] "cases.project.releasable"
## [17] "cases.project.release_requested"
## [18] "cases.project.released"
## [19] "cases.project.request_submission"
## [20] "cases.project.state"
## [21] "cases.project.submission_enabled"
## [22] "cases.samples.days_to_sample_procurement"
## [23] "cases.samples.method_of_sample_procurement"
## [24] "cases.samples.portions.slides.number_proliferating_cells"
## [25] "cases.tissue_source_site.project"
有意思的是,项目信息嵌套在case里面。我们不需要知道细节除了一些信息在某些文件记录中的猜测,另外,我们需要知道在哪里(project_id
)。
files() %>% facet('cases.project.project_id') %>% aggregations()
## $cases.project.project_id
## key doc_count
## 1 FM-AD 36134
## 2 TCGA-BRCA 31511
## 3 TCGA-LUAD 17051
## 4 TCGA-UCEC 16130
## 5 TCGA-HNSC 15266
## 6 TCGA-OV 15057
## 7 TCGA-THCA 14420
## 8 TCGA-LUSC 15323
## 9 TCGA-LGG 14723
## 10 TCGA-KIRC 15082
## 11 TCGA-PRAD 14287
## 12 TCGA-COAD 14270
## 13 TCGA-GBM 11973
## 14 TCGA-SKCM 12724
## 15 TCGA-STAD 12845
## 16 TCGA-BLCA 11710
## 17 TCGA-LIHC 10814
## 18 TCGA-CESC 8593
## 19 TCGA-KIRP 8506
## 20 TCGA-SARC 7493
## 21 TCGA-PAAD 5306
## 22 TCGA-ESCA 5270
## 23 TCGA-PCPG 5032
## 24 TCGA-READ 4918
## 25 TCGA-TGCT 4217
## 26 TCGA-THYM 3444
## 27 TCGA-LAML 3960
## 28 TARGET-NBL 2795
## 29 TCGA-ACC 2546
## 30 TCGA-KICH 2324
## 31 TCGA-MESO 2330
## 32 TARGET-AML 2170
## 33 TCGA-UVM 2179
## 34 TCGA-UCS 1658
## 35 TARGET-WT 1406
## 36 TCGA-DLBC 1330
## 37 TCGA-CHOL 1348
## 38 TARGET-OS 47
## 39 TARGET-RT 174
## 40 TARGET-CCSK 15
我们注意到这正是我们需要的,TCGA-OV
也是正确的项目id。同时注意这里使用的filter
跟dplyr
包中的不同。
qfiles = files() %>%
filter( ~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))
## List of 2
## $ op : 'scalar' chr "and"
## $ content:List of 2
## ..$ :List of 2
## .. ..$ op : 'scalar' chr "="
## .. ..$ content:List of 2
## .. .. ..$ field: chr "cases.project.project_id"
## .. .. ..$ value: chr "TCGA-OV"
## ..$ :List of 2
## .. ..$ op : 'scalar' chr "="
## .. ..$ content:List of 2
## .. .. ..$ field: chr "type"
## .. .. ..$ value: chr "gene_expression"
qfiles %>% count()
## [1] 1137
然后生成manifest进行下载仅需要简单的一步:
manifest_df = qfiles %>% manifest()
head(manifest_df)
## # A tibble: 6 x 5
## id filename md5 size state
## <chr> <chr> <chr> <int> <chr>
## 1 567ced20-00cf-46f4-8bb8-a8553eba7f4b b2552f6f-dd1~ 9af0d99~ 258324 live
## 2 05692746-1770-47bd-8faa-f864a37e0084 701b8c71-6c0~ 8e9816f~ 526537 live
## 3 e2d47640-8565-4383-b2dd-0d3e2762e3da b2552f6f-dd1~ e05190e~ 543367 live
## 4 bc6dab72-dc5a-4ca6-aef3-f7242fa8e329 a1c4f19e-079~ 110d8cd~ 253059 live
## 5 0a176c20-f3f3-4bc9-bfe2-b2469e745025 01eac123-1e2~ b40921f~ 540592 live
## 6 2ae73487-7acf-4282-85d8-927f2ab8f18b 12c8b289-b9d~ 4d3c2b9~ 549437 live
注意我们可能处理的有些问题。查下文件名,存在很多文件包含“FPKM”, “FPKM-UQ”或 “counts”之类的。所以还需要进行过滤:
qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
type == 'gene_expression' &
analysis.workflow_type == 'HTSeq - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)
## [1] 379
现在可以使用GDC 数据传输工具(在R中使用transfer()
或者使用命令行)进行所有文件的下载。查看the bulk downloads section。
认证
GDC提供控制和开放数据。控制的数据需要获得授权才能下载,请查看going through the process of obtaining access.
获取控制数据的授权后就可以下载了,首先要拿到一个种子文件(access a GDC authentication token),然后使用该包进行下载。
本包使用认证种子下载数据(查看transfer
和 gdcdata
文档),包含一个帮助函数gdc_token
,它会用下面三种方式查找(按顺序)种子文件:
- 字符串,存储为环境变量
GDC_TOKEN
- 文件路径,存为环境变量
GDC_TOKEN_FILE
- 用户家目录下的文件
.gdc_token
下面是一个例子:
token = gdc_token()
transfer(...,token=token)
# 或者
transfer(...,token=get_token())
数据文件获取和下载
通过GDC API进行数据下载
gdcdata
函数以一个或多个文件id的字符串向量作为参数。生成该向量的简单方式就是生成一个manifest
数据框,然后传入包含文件id的第一列。
fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)
注意对于控制数据,需要提供种子文件。使用BiocParallel
包对与平行下载大量的小文件是非常有用的。
大量下载
大量下载文件功能仅对下载相对比较大的文件起作用,因此可以用这种方法下载BAM文件或者VCF文件。不然最好用上面说的方法。
fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')
使用案例
案例
每个project_id有多少案例?
res = cases() %>% facet("project.project_id") %>% aggregations()
head(res)
## $project.project_id
## key doc_count
## 1 FM-AD 18004
## 2 TARGET-NBL 1127
## 3 TCGA-BRCA 1098
## 4 TARGET-AML 988
## 5 TARGET-WT 652
## 6 TCGA-GBM 617
## 7 TCGA-OV 608
## 8 TCGA-LUAD 585
## 9 TCGA-UCEC 560
## 10 TCGA-KIRC 537
## 11 TCGA-HNSC 528
## 12 TCGA-LGG 516
## 13 TCGA-THCA 507
## 14 TCGA-LUSC 504
## 15 TCGA-PRAD 500
## 16 TCGA-SKCM 470
## 17 TCGA-COAD 461
## 18 TCGA-STAD 443
## 19 TCGA-BLCA 412
## 20 TARGET-OS 381
## 21 TCGA-LIHC 377
## 22 TCGA-CESC 307
## 23 TCGA-KIRP 291
## 24 TCGA-SARC 261
## 25 TCGA-LAML 200
## 26 TCGA-ESCA 185
## 27 TCGA-PAAD 185
## 28 TCGA-PCPG 179
## 29 TCGA-READ 172
## 30 TCGA-TGCT 150
## 31 TCGA-THYM 124
## 32 TCGA-KICH 113
## 33 TCGA-ACC 92
## 34 TCGA-MESO 87
## 35 TCGA-UVM 80
## 36 TARGET-RT 75
## 37 TCGA-DLBC 58
## 38 TCGA-UCS 57
## 39 TCGA-CHOL 51
## 40 TARGET-CCSK 13
library(ggplot2)
ggplot(res$project.project_id,aes(x = key, y = doc_count)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
有多少案例包含在TARGET项目中?
cases() %>% filter(~ project.program.name=='TARGET') %>% count()
## [1] 3236
有多少个案例包含在所有项目中? How many cases are included in all TCGA projects?
cases() %>% filter(~ project.program.name=='TCGA') %>% count()
## [1] 11315
TCGA-BRCA样本类型?
# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
project.project_id=='TCGA-BRCA' ) %>%
facet('samples.sample_type') %>% aggregations()
resp$samples.sample_type
## key doc_count
## 1 Primary Tumor 1098
## 2 Blood Derived Normal 1011
## 3 Solid Tissue Normal 162
## 4 Metastatic 7
获取TCGA-BRCA所有正常样本
# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
samples.sample_type=='Solid Tissue Normal') %>%
GenomicDataCommons::select(c(default_fields(cases()),'samples.sample_type')) %>%
response_all()
count(resp)
## [1] 162
res = resp %>% results()
str(res[1],list.len=6)
## List of 1
## $ updated_datetime: chr [1:162] "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" ...
head(ids(resp))
## [1] "6fa2a667-9c36-4526-8a58-1975e863a806"
## [2] "ef4cbd38-bc79-4d60-a715-647edd2ebe9e"
## [3] "dd3bfb26-b534-4917-9c4d-9fe7b6477762"
## [4] "9ddc3e7b-8b54-4a83-8335-8053940f56c1"
## [5] "f130f376-5801-40f9-975d-a7e2f7b5670d"
## [6] "af577366-0258-49e7-b6af-e70056c081a4"
文件
有多少种可获取信息的文件?
res = files() %>% facet('type') %>% aggregations()
res$type
## key doc_count
## 1 simple_somatic_mutation 64015
## 2 annotated_somatic_mutation 63580
## 3 aligned_reads 45985
## 4 copy_number_segment 44752
## 5 gene_expression 34713
## 6 slide_image 30036
## 7 biospecimen_supplement 25151
## 8 mirna_expression 22976
## 9 clinical_supplement 12496
## 10 methylation_beta_value 12359
## 11 aggregated_somatic_mutation 186
## 12 masked_somatic_mutation 132
ggplot(res$type,aes(x = key,y = doc_count)) + geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
为GBM查找基因水平的RNA-seq定量文件
q = files() %>%
GenomicDataCommons::select(available_fields('files')) %>%
filter(~ cases.project.project_id=='TCGA-GBM' &
data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations()
## $analysis.workflow_type
## key doc_count
## 1 HTSeq - Counts 174
## 2 HTSeq - FPKM 174
## 3 HTSeq - FPKM-UQ 174
# so need to add another filter
file_ids = q %>% filter(~ cases.project.project_id=='TCGA-GBM' &
data_type=='Gene Expression Quantification' &
analysis.workflow_type == 'HTSeq - Counts') %>%
GenomicDataCommons::select('file_id') %>%
response_all() %>%
ids()
切片
从TCGA-BAM获取所有的BAM文件
q = files() %>%
GenomicDataCommons::select(available_fields('files')) %>%
filter(~ cases.project.project_id == 'TCGA-GBM' &
data_type == 'Aligned Reads' &
experimental_strategy == 'RNA-Seq' &
data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()
bamfile = slicing(file_ids[1],regions="chr12:6534405-6538375",token=gdc_token())
library(GenomicAlignments)
aligns = readGAlignments(bamfile)
最后
从我学习后的感受来看,这个包有自己的生态,有部分动作函数与tidyverse
包同名函数存在不同用法。
如果仅是下载数据,更推荐我最近学习了一下的另一个包——TCGAbiolinks
,这个官方包可以作为补充做一些分析和数据转换操作。
-
根据使用查看单个函数详情 ↩