任意基因的任意分组比较是生信技能树[生信爆款入门课程]TCGA数据挖掘部分提到的一个重点。为拓展课堂所学知识,现在对其做下练习巩固和总结。
1.加载并查看输入数据
> rm(list=ls())
> load("for_boxplot.Rdata")
数据包含信息如下
数据解读
exp是tumor-normal都有的表达矩阵,exprSet是只有tumor样本的表达矩阵。meta是临床信息表格,Group是tumor-normal分组信息。mut是突变信息,由maf文件读取并取子集得到。
2.比较任意miRNA在tumor和normal样本中的表达量
>以hsa-mir-143为例
> table(Group)
Group
normal tumor
71 522
> library(ggstatsplot)
> dat = data.frame(gene = exp["hsa-mir-143",],
+ group = Group)
> ggbetweenstats(data = dat, x = group, y = gene,title = "hsa-mir-143")
>
3.任意miRNA在任意两个分组中的表达量对比
只要是可以根据临床信息查到或得到的分组,例如生死、人种、阶段,都可以拿来做分组,需要注意的是需要调整样本顺序,使一一对应。
按照生死、人种、分期分组查看
table(meta$patient.vital_status)
alive dead
358 158
> table(meta$patient.stage_event.pathologic_stage)
i ii iii iv
254 55 124 83
> table(meta$patient.race)
asian black or african american white
8 56 445
> dat = data.frame(gene = exprSet["hsa-mir-143",],
+ vital_status = meta$patient.vital_status,
+ stage = meta$patient.stage_event.pathologic_stage,
+ race = meta$patient.race)
> p1 = ggbetweenstats(data = dat, x = vital_status, y = gene,title = "hsa-mir-143")
p1
> p2 = ggbetweenstats(data = dat, x = stage, y = gene,title = "hsa-mir-143")
> p2
> p3 = ggbetweenstats(data = dat, x = race, y = gene,title = "hsa-mir-143")
> p3
4.根据某个基因是否突变分组比较某miRNA的表达量
> dim(exprSet)
[1] 552 516
> head(mut)
Hugo_Symbol Chromosome Start_Position Tumor_Sample_Barcode t_vaf
1: HNRNPCL2 chr1 13115853 TCGA-G6-A8L7-01A-11D-A36X-10 0.2148148
2: ERMAP chr1 42842993 TCGA-G6-A8L7-01A-11D-A36X-10 0.1650165
3: FAAH chr1 46394349 TCGA-G6-A8L7-01A-11D-A36X-10 0.3114754
4: EPS15 chr1 51448116 TCGA-G6-A8L7-01A-11D-A36X-10 0.1677852
5: HMGCS2 chr1 119764248 TCGA-G6-A8L7-01A-11D-A36X-10 0.2539683
6: NOS1AP chr1 162367063 TCGA-G6-A8L7-01A-11D-A36X-10 0.2098765
pos
1: chr1:13115853
2: chr1:42842993
3: chr1:46394349
4: chr1:51448116
5: chr1:119764248
6: chr1:162367063
> library(stringr)
> length(unique(str_sub(mut$Tumor_Sample_Barcode,1,12)))
[1] 336
> k = str_sub(colnames(exprSet),1,12) %in% unique(str_sub(mut$Tumor_Sample_Barcode,1,12));table(k)
k
FALSE TRUE
185 331
>
> #516个样本中,有331个有突变信息记录,将这些样本对应的表达矩阵取出来。
> expm = exprSet[,k]
>
> VHL_mut=str_sub(as.character(
+ as.data.frame( mut[mut$Hugo_Symbol=='VHL','Tumor_Sample_Barcode'])[,1] ),
+ 1,12)
>
> library(dplyr)
> VHL_mut = mut %>%
+ filter(Hugo_Symbol=='VHL') %>%
+ as.data.frame() %>%
+ pull(Tumor_Sample_Barcode) %>%
+ as.character() %>%
+ str_sub(1,12)
>
> #false 是未突变样本,true是突变样本
>
> tail(rownames(expm))
[1] "hsa-mir-944" "hsa-mir-95" "hsa-mir-96" "hsa-mir-98" "hsa-mir-99a"
[6] "hsa-mir-99b"
> dat=data.frame(gene=expm['hsa-mir-98',],
+ mut= str_sub(colnames(expm),1,12) %in% VHL_mut)
>
> ggbetweenstats(data = dat, x = mut, y = gene)
5.计算每个基因的p值
看看差异是不是显著。
> res.aov <- t.test(gene ~ as.factor(mut), data = dat)
> res.aov$p.value
[1] 0.9086252