转录组分析实战第四节:转录组分析中的技术重复和生物学重复检查

前期的到的基因表达量矩阵,可以得到每个基因的表达量,然而由于我们在做实验过程中的重复(包括技术重复与生物学重复)理论上来讲是可以保持表达量在重复中的一致性。因此我们也可通过这个工作来检查我们是否有正确的重复数据。

Trinity工具包提供了一些可以用于检测重复一致性的脚本。我们今天就通过这些脚本进行检查。

在这个工作之前需要两个数据:

1. 基因表达的counts.matrix 文件
2. 生物学重复的表文件
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/downstr/RSEMout/RSEMout$ l *counts.matrix
RSEM.gene.counts.matrix  RSEM.isoform.counts.matrix
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/downstr/RSEMout/RSEMout$ cat samples.txt 
B25 B251
B25 B252
R25 R251
R25 R252
W25 W251
W25 W252
需要注意的是:samples.txt中的名字需要和matrix中的名字一致,否则没办法识别
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ $TRINITY_HOME/Analysis/DifferentialExpression/PtR \ #调用PtR脚本
--matrix RSEM.isoform.counts.matrix \#指定给定的matrix
--samples samples.txt \#样品重复信息
--log2 \#做一个对数处理
--min_rowSums 10 \#过滤数据指标
--compare_replicates #输出的图像参数
为了作为补充,我们获取这个脚本的帮助文件
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~$ $TRINITY_HOME/Analysis/DifferentialExpression/PtR --help

#################################################################################### 
#
#######################
# Inputs and Outputs: #
#######################
#
#  --matrix <string>        matrix.RAW.normalized.FPKM
#
#  Optional:
#
#  Sample groupings:
#
#  --samples <string>      tab-delimited text file indicating biological replicate relationships.
#                                   ex.
#                                        cond_A    cond_A_rep1
#                                        cond_A    cond_A_rep2
#                                        cond_B    cond_B_rep1
#                                        cond_B    cond_B_rep2
#
#  --gene_factors <string>   tab-delimited file containing gene-to-factor relationships.
#                               ex.
#                                    liver_enriched <tab> gene1
#                                    heart_enriched <tab> gene2
#                                    ...
#                            (use of this data in plotting is noted for corresponding plotting options)
#
#
#  --output <string>        prefix for output file (default: "${matrix_file}.heatmap")
#
#  --save                   save R session (as .RData file)
#  --no_reuse               do not reuse any existing .RData file on initial loading
#
#####################
#  Plotting Actions #
#####################
#
#  --compare_replicates        provide scatter, MA, QQ, and correlation plots to compare replicates.
#
#   
#
#  --barplot_sum_counts        generate a barplot that sums frag counts per replicate across all samples.
#
#  --boxplot_log2_dist <float>        generate a boxplot showing the log2 dist of counts where counts >= min fpkm
#
#  --sample_cor_matrix         generate a sample correlation matrix plot
#    --sample_cor_scale_limits <string>    ex. "-0.2,0.6"
#    --sample_cor_sum_gene_factor_expr <factor=string>    instead of plotting the correlation value, plot the sum of expr according to gene factor
#                                                         requires --gene_factors 
#
#  --sample_cor_subset_matrix <string>  plot the sample correlation matrix, but create a disjoint set for rows,cols.
#                                       The subset of the samples to provide as the columns is provided as parameter.
#
#  --gene_cor_matrix           generate a gene-level correlation matrix plot
#
#  --indiv_gene_cor <string>   generate a correlation matrix and heatmaps for '--top_cor_gene_count' to specified genes (comma-delimited list)
#      --top_cor_gene_count <int>   (requires '--indiv_gene_cor with gene identifier specified')
#      --min_gene_cor_val <float>   (requires '--indiv_gene_cor with gene identifier specified')
#
#  --heatmap                   genes vs. samples heatmap plot
#      --heatmap_scale_limits "<int,int>"  cap scale intensity to low,high  (ie.  "-5,5")
#      --heatmap_colorscheme <string>  default is 'purple,black,yellow'
#                                      a popular alternative is 'green,black,red'
#                                      Specify a two-color gradient like so: "black,yellow".
#
#     # sample (column) labeling order
#      --lexical_column_ordering        order samples by column name lexical order.
#      --specified_column_ordering <string>  comma-delimited list of column names (must match matrix exactly!)
#      --order_columns_by_samples_file  order the columns in the heatmap according to replicate name ordering in the samples file.
#
#     # gene (row) labeling order
#      --order_by_gene_factor           order the genes by their factor (given --gene_factors)
#
#  --gene_heatmaps <string>    generate heatmaps for just one or more specified genes
#                              Requires a comma-delimited list of gene identifiers.
#                              Plots one heatmap containing all specified genes, then separate heatmaps for each gene.
#                                 if --gene_factors set, will include factor annotations as color panel.
#                                 else if --prin_comp set, will include include principal component color panel.
#
#  --prin_comp <int>           generate principal components, include <int> top components in heatmap  
#      --add_prin_comp_heatmaps <int>  draw heatmaps for the top <int> features at each end of the prin. comp. axis.
#                                      (requires '--prin_comp') 
#      --add_top_loadings_pc_heatmap <int>  draw a heatmap containing the <int> top feature loadings across all PCs.
#      --R_prin_comp_method <string>        options: princomp, prcomp (default: prcomp)
#
#  --mean_vs_sd               expression variability plot. (highlight specific genes by category via --gene_factors )
#
#  --var_vs_count_hist <vartype=string>        create histogram of counts of samples having feature expressed within a given expression bin.
#                                              vartype can be any of 'sd|var|cv|fano'
#      --count_hist_num_bins <int>  number of bins to distribute counts in the histogram (default: 10)
#      --count_hist_max_expr <float>  maximum value for the expression histogram (default: max(data))
#      --count_hist_convert_percentages       convert the histogram counts to percentage values.
#
#
#  --per_gene_plots                   plot each gene as a separate expression plot (barplot or lineplot)
#    --per_gene_plot_width <float>     default: 2.5
#    --per_gene_plot_height <float>    default: 2.5
#    --per_gene_plots_per_row <int>   default: 1
#    --per_gene_plots_per_col <int>   default: 2
#    --per_gene_plots_incl_vioplot    include violin plots to show distribution of rep vals
#
########################################################
#  Data Filtering, in order of operation below:  #########################################################
#
#
## Column filters:
#
#  --restrict_samples <string>   comma-delimited list of samples to restrict to (comma-delim list)
#
#  --top_rows <int>         only include the top number of rows in the matrix, as ordered.
#
#  --min_colSums <float>      min number of fragments, default: 0
#
#  --min_expressed_genes <int>           minimum number of genes (rows) for a column (replicate) having at least '--min_gene_expr_val'
#       --min_gene_expr_val <float>   a gene must be at least this value expressed across all samples.  (default: 1)
#
#
## Row Filters:
#
#  --min_rowSums <float>      min number of fragments, default: 0
#
#  --gene_grep <string>     grep on string to restrict to genes
#
#  --min_across_ALL_samples_gene_expr_val <int>   a gene must have this minimum expression value across ALL samples to be retained.
#
#  --min_across_ANY_samples_gene_expr_val <int>   a gene must have at least this expression value across ANY single sample to be retained.
#
#  --min_gene_prevalence <int>   gene must be found expressed in at least this number of columns
#       --min_gene_expr_val <float>   a gene must be at least this value expressed across all samples.  (default: 1)
#
#  --minValAltNA <float>    minimum cell value after above transformations, otherwise convert to NA
#
#  --top_genes <int>        use only the top number of most highly expressed transcripts
#
#  --top_variable_genes <int>      Restrict to the those genes with highest coeff. of variability across samples (use median of replicates)
#
#      --var_gene_method <string>   method for ranking top variable genes ( 'coeffvar|anova', default: 'anova' )
#           --anova_maxFDR <float>    if anova chose, require FDR value <= anova_maxFDR  (default: 0.05)
#            or
#           --anova_maxP <float>    if set, over-rides anova_maxQ  (default, off, uses --anova_maxQ)
#
#  --top_variable_via_stdev_and_mean_expr    perform filtering based on the stdev vs. mean expression plot.
#      Requires both:               (note, if you used --log2 and/or --Zscale, settings below should use those transformed values)
#         --min_stdev_expr <float>       minimum standard deviation in expression
#         --min_mean_expr  <float>       minimum mean expression value 
#
######################################
#  Data transformations:             #
######################################
#
#  --CPM                    convert to counts per million (uses sum of totals before filtering)
#  --CPK                    convert to counts per thousand
#
#  --binary                 all values > 0 are set to 1.  All values < 0 are set to zero.
#
#  --log2
#
#  --center_rows            subtract row mean from each data point. (only used under '--heatmap' )
#
#  --Zscale_rows            Z-scale the values across the rows (genes)  
#
#########################
#  Clustering methods:  #
#########################
#
#  --gene_dist <string>        Setting used for --heatmap (samples vs. genes)
#                                  Options: euclidean, gene_cor
#                                           maximum, manhattan, canberra, binary, minkowski
#                                  (default: 'euclidean')  Note: if using 'gene_cor', set method using '--gene_cor' below.
#
#
#  --sample_dist <string>      Setting used for --heatmap (samples vs. genes)
#                                  Options: euclidean, sample_cor
#                                           maximum, manhattan, canberra, binary, minkowski
#                                  (default: 'euclidean')  Note: if using 'sample_cor', set method using '--sample_cor' below.
#
#
#  --gene_clust <string>       ward, single, complete, average, mcquitty, median, centroid, none (default: complete)
#  --sample_clust <string>     ward, single, complete, average, mcquitty, median, centroid, none (default: complete)
#
#  --gene_cor <string>             Options: pearson, spearman  (default: pearson)
#  --sample_cor <string>           Options: pearson, spearman  (default: pearson)
#
####################
#  Image settings: #
####################
#
#
#  --imgfmt <string>           image type (pdf,svg) with default: pdf
#
#  --img_width <int>           image width
#  --img_height <int>          image height
#
################
# Misc. params #
################
#
#  --write_intermediate_data_tables         writes out the data table after each transformation.
#
#  --show_pipeline_flowchart                describe order of events and exit.
#
####################################################################################
但是在这个过程中会报错,原因是本地的R包没有安装好,然后回头去安装R包,有些R包在Bioconductor上有些就在CRAN里面。R脚本如下
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
installed.packages()
biocLite("qvalue")
help(package='qvalue')
install.packages('fastcluster')
最后结果就是关于一个处理中生物学重复之间的相关性的几个图,放在一个PDF上的
对于图的讲解有机会再讲。(因为我也不知道有什么意义)我先放出来:
image 1 .png

image 2 .png

image 3 .png

image 4 .png
关于这几张图的解释请大家多多指教,另外我后期通过学习也可以晚上对这个图的解读与分析。

=======================

下面进行跨样本间的相关性检测与作图
$TRINITY_HOME/Analysis/DifferentialExpression/PtR \
          --matrix RSEM.isoform.counts.matrix \
          --min_rowSums 10 \
          --samples samples.txt \ #
          --log2 \ #数据转换参数
          --CPM \ #数据转换参数
          --sample_cor_matrix  #输出样品相关性矩阵图
这个代码做出来的结果是不同样本间的数据一致性热图
image 5 .png
热图反应处理之间和处理内部的重复之间的一致性

======================

最后一个结果是通过PCA分析对样品重复关系进行检测。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ $TRINITY_HOME/Analysis/DifferentialExpression/PtR \ 
--matrix RSEM.isoform.counts.matrix \
--samples samples.txt \
--log2 \
--min_rowSums 10 \
--CPM \
--center_rows \
--prin_comp 3
输出结果为PCA分析图(这个图我也看不懂)
PCA Plot
以后有机会在进行解读吧。

重点是我看不懂这些图,请大家多多指教!

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 211,290评论 6 491
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,107评论 2 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 156,872评论 0 347
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,415评论 1 283
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,453评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,784评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,927评论 3 406
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,691评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,137评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,472评论 2 326
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,622评论 1 340
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,289评论 4 329
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,887评论 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,741评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,977评论 1 265
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,316评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,490评论 2 348