count normalization的代码

在前面的count normalization的方法部分,我初步介绍了一些常见的count normalization的方法,这部分我会写出对应的代码。用的数据集是我自己的一个数据。

欢迎大家用自己数据集检查代码是否有错误

# 总共是37336个基因
> head(data)
          WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G01010       15       15      152      106
AT1G01020      353      461      255      279
AT1G03987        1        0        3        0
AT1G01030       23       39       74       56
AT1G01040      695      746     1831     1611
AT1G03993        0        0        0        0

> str(data)
'data.frame':   37336 obs. of  4 variables:
 $ WT_0h_R1: int  15 353 1 23 695 0 412 0 64 13 ...
 $ WT_0h_R2: int  15 461 0 39 746 0 541 0 118 7 ...
 $ WT_4h_R1: int  152 255 3 74 1831 0 517 0 349 0 ...
 $ WT_4h_R2: int  106 279 0 56 1611 0 512 0 629 0 ...

CPM

公式
CPM_i=\frac{X_i}{N}10^6

CPM其实是最简单的,就是你对应基因的count除以你样本的总count,从而得到你对应基因在对应样本里面的百分比。

data_CPM <- (sweep(data, 2, colSums(data), FUN = "/"))*1e6

> head(data_CPM)
             WT_0h_R1   WT_0h_R2   WT_4h_R1  WT_4h_R2
AT1G01010  0.36330819  0.3272056  7.0490344  5.413802
AT1G01020  8.54985270 10.0561188 11.8256828 14.249536
AT1G03987  0.02422055  0.0000000  0.1391257  0.000000
AT1G01030  0.55707256  0.8507346  3.4317668  2.860122
AT1G01040 16.83327940 16.2730252 84.9130397 82.279578
AT1G03993  0.00000000  0.0000000  0.0000000  0.000000

我们可以来检查下

> colSums(data_CPM)
WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2 
   1e+06    1e+06    1e+06    1e+06 

> data[2,2] / sum(data[, 2]) * 1e6
[1] 10.05612

FPKM

公式:
FPKM_i=\frac{\frac{X_i}{l_i}}{N}10^9FPKM_i=\frac{\frac{X_i}{l_i}}{N}10^9
这里就要涉及到基因长度的问题了,我在之前的方法部分也提到过,不同的算法对于长度的界定是不一样的,我这里用的基因长度是直接把外显子overlap,然后去掉冗余部分,得到累加之后的长度。因为我count是利用featureCount得到的,用的GTF是Araport11的GTF,所以我直接用GenomicFeatures包来得到外显子去冗余累加之后的长度。

# 先得到外显子长度
library(GenomicFeatures)
Tx_At <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf",
                         chrominfo = Seqinfo(seqnames = paste0("Chr",c(1:5,"C","M")),
                                             seqlengths = c(30427671,19698289,23459830,18585056,26975502,154478,366924)))
# 会报一些warning
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of
  type stop_codon, exon. This information was ignored.
2: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have stop
  codons that cannot be mapped to an exon: AT1G07320.4, AT1G18180.2,
  AT1G30230.1, AT1G36380.1, AT1G52415.1, AT1G52940.1, AT2G18110.1,
  AT2G35130.1, AT2G35130.3, AT2G39050.1, AT3G13445.1, AT3G17660.1,
  AT3G17660.3, AT3G52070.2, AT3G54680.1, AT3G59450.2, AT4G13730.2,
  AT4G17730.1, AT4G39620.2, AT5G01520.2, AT5G22794.1, AT5G27710.2,
  AT5G45240.2

# 然后发现少6个基因
> genes(Tx_At)
GRanges object with 37330 ranges and 1 metadata column:
            seqnames        ranges strand |     gene_id
               <Rle>     <IRanges>  <Rle> | <character>
  AT1G01010     Chr1     3631-5899      + |   AT1G01010
  AT1G01020     Chr1     6788-9130      - |   AT1G01020
  AT1G01030     Chr1   11649-13714      - |   AT1G01030
  AT1G01040     Chr1   23121-31227      + |   AT1G01040
  AT1G01050     Chr1   31170-33171      - |   AT1G01050
        ...      ...           ...    ... .         ...
  ATMG09730     ChrM 181268-181347      + |   ATMG09730
  ATMG09740     ChrM 191954-192025      - |   ATMG09740
  ATMG09950     ChrM 333651-333725      - |   ATMG09950
  ATMG09960     ChrM 347266-347348      + |   ATMG09960
  ATMG09980     ChrM 359666-359739      - |   ATMG09980
  -------
  seqinfo: 7 sequences from an unspecified genome

# 我们可以看下缺失的6个基因,刚好就是报warning的那几个基因
> rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
[1] "AT1G36380" "AT1G52415" "AT1G52940" "AT2G18110" "AT2G39050" "AT3G54680"
> gene_notin <- rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
> data[gene_notin, ]
          WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G36380       75       82      140      179
AT1G52415        0        0        0        0
AT1G52940        0        0        0        0
AT2G18110      522      488      622      686
AT2G39050       26       31      133      182
AT3G54680     1236     1777      727      728

之所以会不包括这6个基因,应该是这6个基因的stop condon部分并不在exon里面的缘故……

后来我发现gff并不会报这个错

Tx_At <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gff",
                         chrominfo = Seqinfo(seqnames = paste0("Chr",c(1:5,"C","M")),
                                             seqlengths = c(30427671,19698289,23459830,18585056,26975502,154478,366924)))

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type exon. This
  information was ignored.
> genes(Tx_At)
GRanges object with 38194 ranges and 1 metadata column:
            seqnames            ranges strand |     gene_id
               <Rle>         <IRanges>  <Rle> | <character>
  AT1G01010     Chr1         3631-5899      + |   AT1G01010
  AT1G01020     Chr1         6788-9130      - |   AT1G01020
  AT1G01030     Chr1       11649-13714      - |   AT1G01030
  AT1G01040     Chr1       23121-31227      + |   AT1G01040
  AT1G01046     Chr1       28500-28706      + |   AT1G01046
        ...      ...               ...    ... .         ...
    MIR854c     Chr5 11838096-11838316      + |     MIR854c
    MIR854d     Chr5 11689861-11690081      - |     MIR854d
    MIR854e     Chr5 11942467-11942687      + |     MIR854e
     MIR855     Chr2   4674427-4674698      + |      MIR855
    MIR858b     Chr1 26772633-26772935      + |     MIR858b
  -------
  seqinfo: 7 sequences from an unspecified genome

> rownames(data)[!rownames(data) %in% names(genes(Tx_At)) ]
character(0)

然后我们就可以得到对应基因的外显子信息

> exons_gene <- exonsBy(Tx_At, by = "gene")
> exons_gene
GRangesList object of length 38194:
$AT1G01010
GRanges object with 6 ranges and 2 metadata columns:
      seqnames    ranges strand |   exon_id     exon_name
         <Rle> <IRanges>  <Rle> | <integer>   <character>
  [1]     Chr1 3631-3913      + |         1 NAC001:exon:1
  [2]     Chr1 3996-4276      + |         2 NAC001:exon:2
  [3]     Chr1 4486-4605      + |         3 NAC001:exon:3
  [4]     Chr1 4706-5095      + |         4 NAC001:exon:4
  [5]     Chr1 5174-5326      + |         5 NAC001:exon:5
  [6]     Chr1 5439-5899      + |         6 NAC001:exon:6
  -------
  seqinfo: 7 sequences from an unspecified genome

$AT1G01020
GRanges object with 14 ranges and 2 metadata columns:
       seqnames    ranges strand |   exon_id    exon_name
          <Rle> <IRanges>  <Rle> | <integer>  <character>
   [1]     Chr1 6788-7069      - |     26363 ARV1:exon:14
   [2]     Chr1 7157-7232      - |     26364 ARV1:exon:13
   [3]     Chr1 7157-7450      - |     26365 ARV1:exon:12
   [4]     Chr1 7384-7450      - |     26366 ARV1:exon:11
   [5]     Chr1 7564-7649      - |     26367 ARV1:exon:10
   ...      ...       ...    ... .       ...          ...
  [10]     Chr1 8417-8464      - |     26372  ARV1:exon:5
  [11]     Chr1 8571-8737      - |     26373  ARV1:exon:4
  [12]     Chr1 8571-9130      - |     26374  ARV1:exon:3
  [13]     Chr1 8594-8737      - |     26375  ARV1:exon:2
  [14]     Chr1 8594-9130      - |     26376  ARV1:exon:1
  -------
  seqinfo: 7 sequences from an unspecified genome

...
<38192 more elements>

# 我们只用到37336个基因的外显子信息
> exons_gen_expr <- exons_gene[rownames(data)]
> exons_gen_expr
GRangesList object of length 37336:
$AT1G01010
GRanges object with 6 ranges and 2 metadata columns:
      seqnames    ranges strand |   exon_id     exon_name
         <Rle> <IRanges>  <Rle> | <integer>   <character>
  [1]     Chr1 3631-3913      + |         1 NAC001:exon:1
  [2]     Chr1 3996-4276      + |         2 NAC001:exon:2
  [3]     Chr1 4486-4605      + |         3 NAC001:exon:3
  [4]     Chr1 4706-5095      + |         4 NAC001:exon:4
  [5]     Chr1 5174-5326      + |         5 NAC001:exon:5
  [6]     Chr1 5439-5899      + |         6 NAC001:exon:6
  -------
  seqinfo: 7 sequences from an unspecified genome

$AT1G01020
GRanges object with 14 ranges and 2 metadata columns:
       seqnames    ranges strand |   exon_id    exon_name
          <Rle> <IRanges>  <Rle> | <integer>  <character>
   [1]     Chr1 6788-7069      - |     26363 ARV1:exon:14
   [2]     Chr1 7157-7232      - |     26364 ARV1:exon:13
   [3]     Chr1 7157-7450      - |     26365 ARV1:exon:12
   [4]     Chr1 7384-7450      - |     26366 ARV1:exon:11
   [5]     Chr1 7564-7649      - |     26367 ARV1:exon:10
   ...      ...       ...    ... .       ...          ...
  [10]     Chr1 8417-8464      - |     26372  ARV1:exon:5
  [11]     Chr1 8571-8737      - |     26373  ARV1:exon:4
  [12]     Chr1 8571-9130      - |     26374  ARV1:exon:3
  [13]     Chr1 8594-8737      - |     26375  ARV1:exon:2
  [14]     Chr1 8594-9130      - |     26376  ARV1:exon:1
  -------
  seqinfo: 7 sequences from an unspecified genome

...
<37334 more elements>

然后我们做去冗余,统计长度等操作

# reduce去冗余
> reduce(exons_gen_expr)
GRangesList object of length 37336:
$AT1G01010
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     Chr1 3631-3913      +
  [2]     Chr1 3996-4276      +
  [3]     Chr1 4486-4605      +
  [4]     Chr1 4706-5095      +
  [5]     Chr1 5174-5326      +
  [6]     Chr1 5439-5899      +
  -------
  seqinfo: 7 sequences from an unspecified genome

$AT1G01020
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     Chr1 6788-7069      -
  [2]     Chr1 7157-7450      -
  [3]     Chr1 7564-7649      -
  [4]     Chr1 7762-7835      -
  [5]     Chr1 7942-7987      -
  [6]     Chr1 8236-8464      -
  [7]     Chr1 8571-9130      -
  -------
  seqinfo: 7 sequences from an unspecified genome

$AT1G03987
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     Chr1 11101-11372      +
  -------
  seqinfo: 7 sequences from an unspecified genome

...
<37333 more elements>

# width统计长度
> width(reduce(exons_gen_expr))
IntegerList of length 37336
[["AT1G01010"]] 283 281 120 390 153 461
[["AT1G01020"]] 282 294 86 74 46 229 560
[["AT1G03987"]] 272
[["AT1G01030"]] 1525 380
[["AT1G01040"]] 1331 114 211 395 220 173 123 161 234 151 183 165 96 629 98 191 906 165 407 326
[["AT1G03993"]] 788
[["AT1G01050"]] 255 82 121 66 108 66 29 124 171 143
[["AT1G03997"]] 283
[["AT1G01060"]] 225 666 1074 81 270 82 62 112 849
[["AT1G01070"]] 611 152 406 117 545
...
<37326 more elements>

# sum下
> gene_exon_length <- sum(width(reduce(exons_gen_expr)))
> head(gene_exon_length)
AT1G01010 AT1G01020 AT1G03987 AT1G01030 AT1G01040 AT1G03993 
     1688      1571       272      1905      6279       788 

# 然后再看看基因的顺序和我们data上的基因顺序是不是一致的
# 是一样的
> mean(names(gene_exon_length) == rownames(data))
[1] 1

参考文章

这样我们就可以计算了

# 先除以长度
> data_count_length <- sweep(data, 2, gene_exon_length , FUN = "/")
> head(data_count_length)
             WT_0h_R1    WT_0h_R2   WT_4h_R1   WT_4h_R2
AT1G01010 0.008886256 0.008886256 0.09004739 0.06279621
AT1G01020 0.224697645 0.293443666 0.16231700 0.17759389
AT1G03987 0.003676471 0.000000000 0.01102941 0.00000000
AT1G01030 0.012073491 0.020472441 0.03884514 0.02939633
AT1G01040 0.110686415 0.118808728 0.29160694 0.25656952
AT1G03993 0.000000000 0.000000000 0.00000000 0.00000000

# 再除以文库的总count
> data_FRPKM <- sweep(data_count_length, 2, colSums(data), FUN = "/")*1e9
> head(data_FRPKM)
            WT_0h_R1  WT_0h_R2   WT_4h_R1  WT_4h_R2
AT1G01010 0.21522997 0.1938422  4.1759683  3.207229
AT1G01020 5.44229962 6.4010941  7.5274874  9.070360
AT1G03987 0.08904612 0.0000000  0.5114915  0.000000
AT1G01030 0.29242654 0.4465798  1.8014524  1.501376
AT1G01040 2.68088540 2.5916587 13.5233381 13.103930
AT1G03993 0.00000000 0.0000000  0.0000000  0.000000

检查下

# 我们会发现各个文库之间的FPKM总和是不一样的,这也是FPKM受到人质疑的原因
> colSums(data_FRPKM)
WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2 
752371.2 716130.4 616267.3 608064.9

# 我们检查一个基因
> rownames(data_FRPKM)[10]
[1] "AT1G01070"
> data_FRPKM[rownames(data_FRPKM)[10],]
  WT_0h_R1   WT_0h_R2   WT_4h_R1   WT_4h_R2 
0.17196455 0.08339484 0.00000000 0.00000000 

# 手动算一个基因的FPKM值
> gene_exon_length["AT1G01070"]
AT1G01070 
     1831 
> (data["AT1G01070",1] / 1831) / sum(data[, 1]) * 1e9
[1] 0.1719646

> (data["AT1G01070", ] / 1831) / colSums(data) * 1e9
           WT_0h_R1   WT_0h_R2 WT_4h_R1 WT_4h_R2
AT1G01070 0.1719646 0.08339484        0        0

TPM

公式:
TPM_i=\frac{\frac{X_i}{l_i}}{\sum_j\frac{X_j}{l_j}}10^6

> data_TPM <- sweep(data_count_length, 2, colSums(data_count_length), FUN = "/") * 1e6
> head(data_TPM)
           WT_0h_R1  WT_0h_R2   WT_4h_R1  WT_4h_R2
AT1G01010 0.2860689 0.2706800  6.7762290  5.274484
AT1G01020 7.2335303 8.9384480 12.2146472 14.916763
AT1G03987 0.1183540 0.0000000  0.8299832  0.000000
AT1G01030 0.3886732 0.6236013  2.9231673  2.469105
AT1G01040 3.5632485 3.6189762 21.9439494 21.550216
AT1G03993 0.0000000 0.0000000  0.0000000  0.000000
> colSums(data_TPM)
WT_0h_R1 WT_0h_R2 WT_4h_R1 WT_4h_R2 
   1e+06    1e+06    1e+06    1e+06 

DESeq2的norm count

library(DESeq2)

coldata <- data.frame(row.names = colnames(data), 
                      type = gsub("_R\\d+", "", colnames(data)))

> coldata
          type
WT_0h_R1 WT_0h
WT_0h_R2 WT_0h
WT_4h_R1 WT_4h
WT_4h_R2 WT_4h


dds <- DESeqDataSetFromMatrix(countData = data,
                              colData = coldata,
                              design= ~ type)

> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


> resultsNames(dds)
[1] "Intercept"           "type_WT_4h_vs_WT_0h"

> res <- lfcShrink(dds, coef=2, type="apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895


> data_DESeq2_norm <- DESeq2::counts(dds, normalized = TRUE)
> head(data_DESeq2_norm)
            WT_0h_R1  WT_0h_R2    WT_4h_R1   WT_4h_R2
AT1G01010  16.519753  14.86971  143.445091  100.86482
AT1G01020 388.764862 456.99580  240.648014  265.48381
AT1G03987   1.101317   0.00000    2.831153    0.00000
AT1G01030  25.330288  38.66125   69.835110   53.28707
AT1G01040 765.415238 739.52032 1727.947115 1532.95488
AT1G03993   0.000000   0.00000    0.000000    0.00000

还有个我没用到的链接
https://gist.github.com/slowkow/c6ab0348747f86e2748b

edgeR的TMM

不太会……不敢说……懒得写

下一部分我会探究不同的normaliztion的差异

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 210,914评论 6 490
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 89,935评论 2 383
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 156,531评论 0 345
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,309评论 1 282
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,381评论 5 384
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,730评论 1 289
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,882评论 3 404
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,643评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,095评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,448评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,566评论 1 339
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,253评论 4 328
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,829评论 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,715评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,945评论 1 264
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,248评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,440评论 2 348