【r<-包】使用GenomicDataCommons包下载和处理TCGA数据

资料来源该包手册。R社区已经有不少下载和处理TCGA数据的包,但目前能获取最新GRch38的应该只有TCGAbiolinks和本包,本包也是TCGA开发的官方R包,值得信赖。 比较麻烦的就是文件,样本ID的转换。

GDC是什么?

来自官网Genomic Data Commons (GDC) website的信息:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs.

The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared.

As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview. The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details. The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.

快速开始

这个软件尚在开发之中,希望收到用户的反馈。如果想要报告bug或者问题,要么submit a new issue或者在R中使用bug.report(package='GenomicDataCommons')

安装

从bioconductor:

source('https://bioconductor.org/biocLite.R')
biocLite('GenomicDataCommons')

导入:

library(GenomicDataCommons)

查看基本的特性

GenomicDataCommons::status()
## $commit
## [1] "e9e20d6f97f2bf6dd3b3261e36ead57c56a4c7cc"
## 
## $data_release
## [1] "Data Release 12.0 - June 13, 2018"
## 
## $status
## [1] "OK"
## 
## $tag
## [1] "1.14.1"
## 
## $version
## [1] 1

If this statement results in an error such as SSL connect error, see the troubleshooting section below.

寻找数据

下面的代码构建了一个manifest用来引导原始数据的下载,使用HTSeq查找和过滤Ovarian Cancer的基因表达的原始计数。

library(magrittr)
ge_manifest = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

下载数据

下面的代码块下载 379 个基因表达数据文件。使用多进程进行下载可以极快地提高下载速度。

destdir = tempdir()
fnames = lapply(ge_manifest$id[1:20],gdcdata)

如果下载的数据包含控制data,下载时需要包括一个token(有下载权限的标志)。具体请看the authentication section below.

元数据获取

expands = c("diagnoses","annotations",
             "demographic","exposures")
clinResults = cases() %>%
    GenomicDataCommons::select(NULL) %>%
    GenomicDataCommons::expand(expands) %>%
    results(size=50)
str(clinResults,list.len=10)
## List of 4
##  $ diagnoses  :List of 50
##   ..$ 3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Lung, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-19T09:21:58.285024-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD7376_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Liver"
##   .. .. [list output truncated]
##   ..$ 36a29f50-5081-4a45-ab83-b11164e6781a:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Lung, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:15:26.111532-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Small cell carcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD15003_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Lymph node, NOS"
##   .. .. [list output truncated]
##   ..$ b8ed18f9-2582-4454-a5d8-529370d9da2b:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "Unknown"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Head, face or neck, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:04:08.462038-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Squamous cell carcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD4846_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Lung, NOS"
##   .. .. [list output truncated]
##   ..$ fd9ffb79-7d96-4a14-955e-236011d283b6:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "primary"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Colon, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-19T09:25:18.484242-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD8471_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Colon, NOS"
##   .. .. [list output truncated]
##   ..$ 672fdc0b-0844-4bbf-8b3a-1bff693d17fc:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "primary"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Stomach, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:04:28.356056-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD880_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Stomach, NOS"
##   .. .. [list output truncated]
##   ..$ 3468b376-b05e-40e7-851b-25bb9251378f:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "Unknown"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Unknown"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:05:07.197842-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD103_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Not Reported"
##   .. .. [list output truncated]
##   ..$ 2d29df1c-a548-4e95-8206-3c473e9ffdca:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Lung, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T15:43:06.585011-05:00"
##   .. ..$ updated_datetime                 : chr "2018-01-29T14:25:29.405142-06:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD2816_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Connective, subcutaneous and other soft tissues, NOS"
##   .. .. [list output truncated]
##   ..$ 5bd74091-bef3-49a1-b617-45167f02baa6:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "Unknown"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Unknown"
##   .. ..$ created_datetime                 : chr "2017-06-16T15:39:59.533998-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD10160_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Bone, NOS"
##   .. .. [list output truncated]
##   ..$ 48758eeb-6d56-41ff-b0f5-e2e5bfa3e63a:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Anus, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-19T09:33:40.540869-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Squamous cell carcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD9751_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Brain, NOS"
##   .. .. [list output truncated]
##   ..$ 2bac2a54-5843-498d-bfeb-6d795ba54665:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "primary"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Kidney, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T15:43:06.585011-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Papillary renal cell carcinoma"
##   .. ..$ submitter_id                     : chr "AD2814_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Kidney, NOS"
##   .. .. [list output truncated]
##   .. [list output truncated]
##  $ case_id    : chr [1:50] "3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6" "36a29f50-5081-4a45-ab83-b11164e6781a" "b8ed18f9-2582-4454-a5d8-529370d9da2b" "fd9ffb79-7d96-4a14-955e-236011d283b6" ...
##  $ demographic:'data.frame': 50 obs. of  8 variables:
##   ..$ updated_datetime: chr [1:50] "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" ...
##   ..$ created_datetime: chr [1:50] "2017-06-19T11:44:09.223033-05:00" "2017-06-19T11:42:12.871583-05:00" "2017-06-19T11:33:07.960036-05:00" "2017-06-19T11:45:29.124685-05:00" ...
##   ..$ gender          : chr [1:50] "male" "male" "female" "female" ...
##   ..$ submitter_id    : chr [1:50] "AD7376_demographic" "AD15003_demographic" "AD4846_demographic" "AD8471_demographic" ...
##   ..$ state           : chr [1:50] "submitted" "submitted" "submitted" "submitted" ...
##   ..$ race            : chr [1:50] "not reported" "not reported" "not reported" "not reported" ...
##   ..$ demographic_id  : chr [1:50] "39e8ea37-69ca-4541-b26c-4dbaa5cd8b77" "3a456586-6198-44d7-9dfa-fb510796d3ad" "ec055724-878d-4c2f-881c-aae7fcf8e5aa" "6f286488-6b1d-432c-8014-bacf68423fd6" ...
##   ..$ ethnicity       : chr [1:50] "not reported" "not reported" "not reported" "not reported" ...
##  $ id         : chr [1:50] "3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6" "36a29f50-5081-4a45-ab83-b11164e6781a" "b8ed18f9-2582-4454-a5d8-529370d9da2b" "fd9ffb79-7d96-4a14-955e-236011d283b6" ...
##  - attr(*, "row.names")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"

基本设计

这个包的设计跟dplyr的"hadleyverse"方法有点类似。大致上,寻找和获取文件、元数据的函数可以分为:

  1. 基于GDC API终点的简单查询构建器
  2. 应用,修正过滤,字段选择和分面的动词集合并生成一个新的查询对象
  3. 执行查询和从GDC返回结果的动词集合

另外还有一些询问GDC API信息与可获取的字段,索引BAM文件,下载实际数据文件的函数。

下面是一个概览[1]

  • Creating a query
    • projects()
    • cases()
    • files()
    • annotations()
  • Manipulating a query
    • filter()
    • facet()
    • select()
  • Introspection on the GDC API fields
    • mapping()
    • available_fields()
    • default_fields()
    • grep_fields()
    • field_picker()
    • available_values()
    • available_expand()
  • Executing an API call to retrieve query results
    • results()
    • count()
    • response()
  • Raw data file downloads
    • gdcdata()
    • transfer()
    • gdc_client()
  • Summarizing and aggregating field values (faceting)
    • aggregations()
  • Authentication
    • gdc_token()
  • BAM file slicing
    • slicing()

使用

处理NCI GDC是存在两类操作。

  1. 查询元数据和查找数据文件 (例如,为某类癌症病人查找所有的基因表达定量数据文件)
  2. 从GDC传输原始或处理过的数据到另一台电脑

这两类操作在下面进行详述。

查询元数据

大量关于病人、文件、项目和所谓注释的元数据都可以通过NCI GDC API获取。通常,我们想要查询元数据,然后进行下载或者执行所谓的聚合操作(和table()类似的功能)

首先创建一个空查询获取元数据。我们然后经常想要filter查询,对retrieving results提前进行一些限制。GenomicDataCommons包有列出条目的帮助函数用来帮助过滤。

创建查询

下面4个函数可以创建GDCQuery对象用来查询元数据:

  • projects()
  • cases()
  • files()
  • annotations()
pquery = projects()

pquery对象现在是一个S3类,GDCQuery。对象包含下面一些元素:

  • 字段:这是一个关于字段的字符串向量,在检索数据时会返回。如果没有指定字段,会使用默认字段 (查看 default_fields())。
  • filters: 在使用filter()方法后会返回结果用于对后续的结果检索进行过滤。
  • facets: 一个字段的字符串向量,在聚合数据(aggreations())时使用。
  • archive: 要么“default”要么“legacy”
  • token: 来自GDC的token字符串。查看the authentication section获取详情,注意,通常检索原始数据是不需要权限的,有些数据的下载需要。

查看实际的对象(习惯使用str()!),注意查询不会包含结果。

str(pquery)
## List of 5
##  $ fields : chr [1:16] "awg_review" "dbgap_accession_number" "disease_type" "in_review" ...
##  $ filters: NULL
##  $ facets : NULL
##  $ legacy : logi FALSE
##  $ expand : NULL
##  - attr(*, "class")= chr [1:3] "gdc_projects" "GDCQuery" "list"

检索结果

[ GDC分页文档]

[ GDC排序文档 ]

如果构建好了查询对象,下一步是从GDC检索结果。检索结果的最基本类型是满足条件的、可获取记录的简单counts()。注意我们刚才并未设置如何过滤,所以count()会返回所有满足标准的项目记录。

pcount = count(pquery)
# 或者
pcount = pquery %>% count()
pcount
## [1] 40

results()方法会取回实际的结果:

presults = pquery %>% results()

这些从GDC返回的结果都以JSON格式存储,函数自动将它转换为R里面的嵌套列表。

str(presults)
## List of 8
##  $ dbgap_accession_number: chr [1:10] "phs001179" "phs000470" NA NA ...
##  $ disease_type          :List of 10
##   ..$ FM-AD    : chr [1:23] "Germ Cell Neoplasms" "Acinar Cell Neoplasms" "Miscellaneous Tumors" "Thymic Epithelial Neoplasms" ...
##   ..$ TARGET-RT: chr "Rhabdoid Tumor"
##   ..$ TCGA-UCS : chr "Uterine Carcinosarcoma"
##   ..$ TCGA-LUSC: chr "Lung Squamous Cell Carcinoma"
##   ..$ TCGA-BRCA: chr "Breast Invasive Carcinoma"
##   ..$ TCGA-SKCM: chr "Skin Cutaneous Melanoma"
##   ..$ TARGET-OS: chr "Osteosarcoma"
##   ..$ TCGA-THYM: chr "Thymoma"
##   ..$ TARGET-WT: chr "High-Risk Wilms Tumor"
##   ..$ TCGA-ESCA: chr "Esophageal Carcinoma"
##  $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ state                 : chr [1:10] "open" "open" "open" "open" ...
##  $ primary_site          :List of 10
##   ..$ FM-AD    : chr [1:42] "Kidney" "Testis" "Unknown" "Other and unspecified parts of biliary tract" ...
##   ..$ TARGET-RT: chr "Kidney"
##   ..$ TCGA-UCS : chr "Uterus"
##   ..$ TCGA-LUSC: chr "Lung"
##   ..$ TCGA-BRCA: chr "Breast"
##   ..$ TCGA-SKCM: chr "Skin"
##   ..$ TARGET-OS: chr "Bone"
##   ..$ TCGA-THYM: chr "Thymus"
##   ..$ TARGET-WT: chr "Kidney"
##   ..$ TCGA-ESCA: chr "Esophagus"
##  $ project_id            : chr [1:10] "FM-AD" "TARGET-RT" "TCGA-UCS" "TCGA-LUSC" ...
##  $ id                    : chr [1:10] "FM-AD" "TARGET-RT" "TCGA-UCS" "TCGA-LUSC" ...
##  $ name                  : chr [1:10] "Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)" "Rhabdoid Tumor" "Uterine Carcinosarcoma" "Lung Squamous Cell Carcinoma" ...
##  - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

默认只返回10条记录,我们可以对results()添加sizefrom参数改变数目。这里存在一个简单的方法,results_all()会返回所有可获得的结果。小心使用这个函数,它可以消耗非常长的时间,数据巨大。

length(ids(presults))
## [1] 10
presults = pquery %>% results_all()
length(ids(presults))
## [1] 40
# 包含所有的记录
length(ids(presults)) == count(pquery)
## [1] TRUE

抽取结果的子集或者将结果变成更通用的R数据结构不是很简单,然后可以借助 purrrrlist、 和data.tree

想要交互式浏览结果,使用listviewer包。

字段和值

[ GDC fields 文档 ]

查询和检索GDC数据的中心是指定想要返回的字段、根据字段与值进行过滤、分面或者聚合。本包包含两个简单地函数,available_fields()default_fields()。每个都可以操作“cases”, “files”, “annotations”, “projects”或者GDCQuery对象。

default_fields('files')
##  [1] "access"                "acl"                  
##  [3] "batch_id"              "created_datetime"     
##  [5] "data_category"         "data_format"          
##  [7] "data_type"             "error_type"           
##  [9] "experimental_strategy" "file_autocomplete"    
## [11] "file_id"               "file_name"            
## [13] "file_size"             "file_state"           
## [15] "imaging_date"          "magnification"        
## [17] "md5sum"                "origin"               
## [19] "platform"              "read_pair_number"     
## [21] "revision"              "state"                
## [23] "state_comment"         "submitter_id"         
## [25] "tags"                  "type"                 
## [27] "updated_datetime"
# The number of fields available for files endpoint
length(available_fields('files'))
## [1] 703
# The first few fields available for files endpoint
head(available_fields('files'))
## [1] "access"                    "acl"                      
## [3] "analysis.analysis_id"      "analysis.analysis_type"   
## [5] "analysis.batch_id"         "analysis.created_datetime"

字段可以通过类似dplyr包流程的方式指定,select()函数是重设GDCQuery对象字段槽的动词;注意这与dplyr限制已有字段不同。

# 这里是默认字段
qcases = cases()
qcases$fields
##  [1] "aliquot_ids"              "analyte_ids"             
##  [3] "batch_id"                 "case_autocomplete"       
##  [5] "case_id"                  "created_datetime"        
##  [7] "days_to_index"            "days_to_lost_to_followup"
##  [9] "disease_type"             "index_date"              
## [11] "lost_to_followup"         "portion_ids"             
## [13] "primary_site"             "sample_ids"              
## [15] "slide_ids"                "state"                   
## [17] "submitter_aliquot_ids"    "submitter_analyte_ids"   
## [19] "submitter_id"             "submitter_portion_ids"   
## [21] "submitter_sample_ids"     "submitter_slide_ids"     
## [23] "updated_datetime"
# 使用所有的字段
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)
## [1] "case_id"                   "aliquot_ids"              
## [3] "analyte_ids"               "annotations.annotation_id"
## [5] "annotations.batch_id"      "annotations.case_id"

grep_fields()field_picker()可以操作寻找感兴趣的字段。

分面与聚合

[ GDC facet 文档 ]

有点类似R的table方法,GDC API提供了称为聚合或分面的操作,通过指定一个或多个字段,GDC会返回所有可能值的计数。

# 指定文件类型的综述
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type
##                            key doc_count
## 1      simple_somatic_mutation     64015
## 2   annotated_somatic_mutation     63580
## 3                aligned_reads     45985
## 4          copy_number_segment     44752
## 5              gene_expression     34713
## 6                  slide_image     30036
## 7       biospecimen_supplement     25151
## 8             mirna_expression     22976
## 9          clinical_supplement     12496
## 10      methylation_beta_value     12359
## 11 aggregated_somatic_mutation       186
## 12     masked_somatic_mutation       132

Filtering

[ GDC filtering 文档 ]

The GenomicDataCommons package 使用了一种非标准的评估形式来指定类似于R的查询,然后翻译为R列表。这个R表达式使用了公式接口,如Hadley Wickham 在vignette on non-standard evaluation中所建议。

It’s best to use a formula because a formula captures both the expression to evaluate and the environment where the evaluation occurs. This is important if the expression is a mixture of variables in a data frame and objects in the local environment [for example].

对于用户来说不需要关注很多底层细节,除了注意过滤表达式必须以~开始。

qfiles = files()
qfiles %>% count() # all files
## [1] 356381

过滤文件类型为基因表达:

qfiles = files() %>% filter(~ type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))
## List of 2
##  $ op     : 'scalar' chr "="
##  $ content:List of 2
##   ..$ field: chr "type"
##   ..$ value: chr "gene_expression"

要是我们想创建一个基于项目(比如“TCGA-OVCA”)的过滤该怎么办?我们有一些方法可以发现可获取的字段。

第一种是基于一些基本的R函数和直觉。

grep('pro',available_fields('files'),value=TRUE)
##  [1] "cases.diagnoses.progression_free_survival"               
##  [2] "cases.diagnoses.progression_free_survival_event"         
##  [3] "cases.diagnoses.progression_or_recurrence"               
##  [4] "cases.project.awg_review"                                
##  [5] "cases.project.dbgap_accession_number"                    
##  [6] "cases.project.disease_type"                              
##  [7] "cases.project.in_review"                                 
##  [8] "cases.project.intended_release_date"                     
##  [9] "cases.project.is_legacy"                                 
## [10] "cases.project.name"                                      
## [11] "cases.project.primary_site"                              
## [12] "cases.project.program.dbgap_accession_number"            
## [13] "cases.project.program.name"                              
## [14] "cases.project.program.program_id"                        
## [15] "cases.project.project_id"                                
## [16] "cases.project.releasable"                                
## [17] "cases.project.release_requested"                         
## [18] "cases.project.released"                                  
## [19] "cases.project.request_submission"                        
## [20] "cases.project.state"                                     
## [21] "cases.project.submission_enabled"                        
## [22] "cases.samples.days_to_sample_procurement"                
## [23] "cases.samples.method_of_sample_procurement"              
## [24] "cases.samples.portions.slides.number_proliferating_cells"
## [25] "cases.tissue_source_site.project"

有意思的是,项目信息嵌套在case里面。我们不需要知道细节除了一些信息在某些文件记录中的猜测,另外,我们需要知道在哪里(project_id)。

files() %>% facet('cases.project.project_id') %>% aggregations()
## $cases.project.project_id
##            key doc_count
## 1        FM-AD     36134
## 2    TCGA-BRCA     31511
## 3    TCGA-LUAD     17051
## 4    TCGA-UCEC     16130
## 5    TCGA-HNSC     15266
## 6      TCGA-OV     15057
## 7    TCGA-THCA     14420
## 8    TCGA-LUSC     15323
## 9     TCGA-LGG     14723
## 10   TCGA-KIRC     15082
## 11   TCGA-PRAD     14287
## 12   TCGA-COAD     14270
## 13    TCGA-GBM     11973
## 14   TCGA-SKCM     12724
## 15   TCGA-STAD     12845
## 16   TCGA-BLCA     11710
## 17   TCGA-LIHC     10814
## 18   TCGA-CESC      8593
## 19   TCGA-KIRP      8506
## 20   TCGA-SARC      7493
## 21   TCGA-PAAD      5306
## 22   TCGA-ESCA      5270
## 23   TCGA-PCPG      5032
## 24   TCGA-READ      4918
## 25   TCGA-TGCT      4217
## 26   TCGA-THYM      3444
## 27   TCGA-LAML      3960
## 28  TARGET-NBL      2795
## 29    TCGA-ACC      2546
## 30   TCGA-KICH      2324
## 31   TCGA-MESO      2330
## 32  TARGET-AML      2170
## 33    TCGA-UVM      2179
## 34    TCGA-UCS      1658
## 35   TARGET-WT      1406
## 36   TCGA-DLBC      1330
## 37   TCGA-CHOL      1348
## 38   TARGET-OS        47
## 39   TARGET-RT       174
## 40 TARGET-CCSK        15

我们注意到这正是我们需要的,TCGA-OV也是正确的项目id。同时注意这里使用的filterdplyr包中的不同。

qfiles = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))
## List of 2
##  $ op     : 'scalar' chr "and"
##  $ content:List of 2
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "cases.project.project_id"
##   .. .. ..$ value: chr "TCGA-OV"
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "type"
##   .. .. ..$ value: chr "gene_expression"
qfiles %>% count()
## [1] 1137

然后生成manifest进行下载仅需要简单的一步:

manifest_df = qfiles %>% manifest()
head(manifest_df)
## # A tibble: 6 x 5
##   id                                   filename      md5        size state
##   <chr>                                <chr>         <chr>     <int> <chr>
## 1 567ced20-00cf-46f4-8bb8-a8553eba7f4b b2552f6f-dd1~ 9af0d99~ 258324 live 
## 2 05692746-1770-47bd-8faa-f864a37e0084 701b8c71-6c0~ 8e9816f~ 526537 live 
## 3 e2d47640-8565-4383-b2dd-0d3e2762e3da b2552f6f-dd1~ e05190e~ 543367 live 
## 4 bc6dab72-dc5a-4ca6-aef3-f7242fa8e329 a1c4f19e-079~ 110d8cd~ 253059 live 
## 5 0a176c20-f3f3-4bc9-bfe2-b2469e745025 01eac123-1e2~ b40921f~ 540592 live 
## 6 2ae73487-7acf-4282-85d8-927f2ab8f18b 12c8b289-b9d~ 4d3c2b9~ 549437 live

注意我们可能处理的有些问题。查下文件名,存在很多文件包含“FPKM”, “FPKM-UQ”或 “counts”之类的。所以还需要进行过滤:

qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
                            type == 'gene_expression' &
                            analysis.workflow_type == 'HTSeq - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)
## [1] 379

现在可以使用GDC 数据传输工具(在R中使用transfer()或者使用命令行)进行所有文件的下载。查看the bulk downloads section

认证

[ GDC 认证 documentation ]

GDC提供控制和开放数据。控制的数据需要获得授权才能下载,请查看going through the process of obtaining access.

获取控制数据的授权后就可以下载了,首先要拿到一个种子文件(access a GDC authentication token),然后使用该包进行下载。

本包使用认证种子下载数据(查看transfergdcdata 文档),包含一个帮助函数gdc_token,它会用下面三种方式查找(按顺序)种子文件:

  1. 字符串,存储为环境变量GDC_TOKEN
  2. 文件路径,存为环境变量GDC_TOKEN_FILE
  3. 用户家目录下的文件.gdc_token

下面是一个例子:

token = gdc_token()
transfer(...,token=token)
# 或者
transfer(...,token=get_token())

数据文件获取和下载

通过GDC API进行数据下载

gdcdata函数以一个或多个文件id的字符串向量作为参数。生成该向量的简单方式就是生成一个manifest数据框,然后传入包含文件id的第一列。

fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)

注意对于控制数据,需要提供种子文件。使用BiocParallel包对与平行下载大量的小文件是非常有用的。

大量下载

大量下载文件功能仅对下载相对比较大的文件起作用,因此可以用这种方法下载BAM文件或者VCF文件。不然最好用上面说的方法。

fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')

使用案例

案例

每个project_id有多少案例?

res = cases() %>% facet("project.project_id") %>% aggregations()
head(res)
## $project.project_id
##            key doc_count
## 1        FM-AD     18004
## 2   TARGET-NBL      1127
## 3    TCGA-BRCA      1098
## 4   TARGET-AML       988
## 5    TARGET-WT       652
## 6     TCGA-GBM       617
## 7      TCGA-OV       608
## 8    TCGA-LUAD       585
## 9    TCGA-UCEC       560
## 10   TCGA-KIRC       537
## 11   TCGA-HNSC       528
## 12    TCGA-LGG       516
## 13   TCGA-THCA       507
## 14   TCGA-LUSC       504
## 15   TCGA-PRAD       500
## 16   TCGA-SKCM       470
## 17   TCGA-COAD       461
## 18   TCGA-STAD       443
## 19   TCGA-BLCA       412
## 20   TARGET-OS       381
## 21   TCGA-LIHC       377
## 22   TCGA-CESC       307
## 23   TCGA-KIRP       291
## 24   TCGA-SARC       261
## 25   TCGA-LAML       200
## 26   TCGA-ESCA       185
## 27   TCGA-PAAD       185
## 28   TCGA-PCPG       179
## 29   TCGA-READ       172
## 30   TCGA-TGCT       150
## 31   TCGA-THYM       124
## 32   TCGA-KICH       113
## 33    TCGA-ACC        92
## 34   TCGA-MESO        87
## 35    TCGA-UVM        80
## 36   TARGET-RT        75
## 37   TCGA-DLBC        58
## 38    TCGA-UCS        57
## 39   TCGA-CHOL        51
## 40 TARGET-CCSK        13
library(ggplot2)
ggplot(res$project.project_id,aes(x = key, y = doc_count)) +
    geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

img

有多少案例包含在TARGET项目中?

cases() %>% filter(~ project.program.name=='TARGET') %>% count()
## [1] 3236

有多少个案例包含在所有项目中? How many cases are included in all TCGA projects?

cases() %>% filter(~ project.program.name=='TCGA') %>% count()
## [1] 11315

TCGA-BRCA样本类型?

# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                              project.project_id=='TCGA-BRCA' ) %>%
    facet('samples.sample_type') %>% aggregations()
resp$samples.sample_type
##                    key doc_count
## 1        Primary Tumor      1098
## 2 Blood Derived Normal      1011
## 3  Solid Tissue Normal       162
## 4           Metastatic         7

获取TCGA-BRCA所有正常样本

# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                              samples.sample_type=='Solid Tissue Normal') %>%
    GenomicDataCommons::select(c(default_fields(cases()),'samples.sample_type')) %>%
    response_all()
count(resp)
## [1] 162
res = resp %>% results()
str(res[1],list.len=6)
## List of 1
##  $ updated_datetime: chr [1:162] "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" ...
head(ids(resp))
## [1] "6fa2a667-9c36-4526-8a58-1975e863a806"
## [2] "ef4cbd38-bc79-4d60-a715-647edd2ebe9e"
## [3] "dd3bfb26-b534-4917-9c4d-9fe7b6477762"
## [4] "9ddc3e7b-8b54-4a83-8335-8053940f56c1"
## [5] "f130f376-5801-40f9-975d-a7e2f7b5670d"
## [6] "af577366-0258-49e7-b6af-e70056c081a4"

文件

有多少种可获取信息的文件?

res = files() %>% facet('type') %>% aggregations()
res$type
##                            key doc_count
## 1      simple_somatic_mutation     64015
## 2   annotated_somatic_mutation     63580
## 3                aligned_reads     45985
## 4          copy_number_segment     44752
## 5              gene_expression     34713
## 6                  slide_image     30036
## 7       biospecimen_supplement     25151
## 8             mirna_expression     22976
## 9          clinical_supplement     12496
## 10      methylation_beta_value     12359
## 11 aggregated_somatic_mutation       186
## 12     masked_somatic_mutation       132
ggplot(res$type,aes(x = key,y = doc_count)) + geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

img

为GBM查找基因水平的RNA-seq定量文件

q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    filter(~ cases.project.project_id=='TCGA-GBM' &
               data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations()
## $analysis.workflow_type
##               key doc_count
## 1  HTSeq - Counts       174
## 2    HTSeq - FPKM       174
## 3 HTSeq - FPKM-UQ       174
# so need to add another filter
file_ids = q %>% filter(~ cases.project.project_id=='TCGA-GBM' &
                            data_type=='Gene Expression Quantification' &
                            analysis.workflow_type == 'HTSeq - Counts') %>%
    GenomicDataCommons::select('file_id') %>%
    response_all() %>%
    ids()

切片

从TCGA-BAM获取所有的BAM文件

q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    filter(~ cases.project.project_id == 'TCGA-GBM' &
               data_type == 'Aligned Reads' &
               experimental_strategy == 'RNA-Seq' &
               data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()
bamfile = slicing(file_ids[1],regions="chr12:6534405-6538375",token=gdc_token())
library(GenomicAlignments)
aligns = readGAlignments(bamfile)

最后

从我学习后的感受来看,这个包有自己的生态,有部分动作函数与tidyverse包同名函数存在不同用法。

如果仅是下载数据,更推荐我最近学习了一下的另一个包——TCGAbiolinks,这个官方包可以作为补充做一些分析和数据转换操作。


  1. 根据使用查看单个函数详情

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,384评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,845评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,148评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,640评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,731评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,712评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,703评论 3 415
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,473评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,915评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,227评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,384评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,063评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,706评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,302评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,531评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,321评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,248评论 2 352

推荐阅读更多精彩内容