小RNA测序数据长度分布规律:
1、24nt>21nt>22nt>23nt
2、>60%数据在20-24nt之间
一、数据为fasta格式,首先写一个脚本统计各种长度的sRNA的数量
#!/usr/bin/env python3
import sys
import collections
inFile = open(sys.argv[1],'r')
outFile = open ('sRNA_count.csv', 'w')
lenlist = []
while True:
line = inFile.readline()
if not line:break
if ">" not in line:
line = line.rstrip()
lenlist.append(len(line))
lenlist.sort()
lencount = collections.Counter(lenlist)
for length in lencount:
outFile.write(str(length) + "\t" + str(lencount[length]) + "\n")
inFile.close()
#运行命令
python sRNA_count.py sample.sRNA.data.fa
二、手动将sRNA_count.csv进行分列加表头
三、用R语言ggplot2绘制直方图
install.packages('gcookbook')
library(ggplot2)
library(gcookbook)
sRNACount <- read.csv("J:/myProject/sRNA_count/sRNA_count.csv", header = TRUE)
sRNACount
ggplot(sRNACount, aes(x=sRNA, y=NUM)) + scale_y_log10() + xlim(15,30) + geom_bar(stat="identity", fill="lightblue", colour="black")
ggplot(sRNACount, aes(x=sRNA, y=NUM)) + scale_y_log10() + scale_x_continuous(limits=c(14,31),breaks=seq(15,30,1)) + geom_bar(stat="identity", fill="lightblue", colour="black")