改善mosdepth输出结果的python脚本

mosdepth是一个用于计算测序深度的软件
使用方法可参考下面这篇文章
//www.greatytc.com/p/6bd56cd8d59e

mosdepth在计算一个区间的测序深度时,会用这个区间所有碱基比对到的reads数目的和除以这个区间的碱基数目(包括深度为0的位点),在对一些多倍体进行计算深度时可能因此导致总体测序深度偏低
比如测序深度50X的数据可能会因此算出来的平均深度只有5到20X左右

因此为了真实反映测序胜读,我们需要对mosdepth的输出结果进行改善,即把深度为0的碱基位点去掉,再去计算区间的平均深度或中间值深度

首先使用mosdepth输出每个碱基的测序深度
mosdepth -Q 30 -t 8 output input

然后再使用我写的脚本将深度为0的碱基位点去掉,计算区间的深度(平均深度或中间值深度)
脚本如下

 import argparse
import sys
import gzip
import numpy as np
pars=argparse.ArgumentParser()
pars.add_argument("-perbase_file",required=True,help="The output file of mosdepth which contains per base depth")
pars.add_argument("-win_size",required=True,help="The window size for calcuating depth")
pars.add_argument("-out",required=True,help="The output dir")
pars.add_argument("-type",required=True,help="The way of calculating depth,it should be mean or median ")
args=pars.parse_args()

if args.perbase_file:
    file_index=sys.argv.index("-perbase_file")
    perbase_file=sys.argv[file_index+1]
else:
    raise Exception("Please input the per base depth file")

if args.win_size:
    size_index=sys.argv.index("-win_size")
    win_size=int(sys.argv[size_index+1])
else:
    raise Exception("Please input the window size")

if args.out:
    out_index=sys.argv.index("-out")
    out_dir=sys.argv[out_index+1]
else:
    raise Exception("Please input the output file")

if args.type:
    type_index=sys.argv.index("-type")
    cal_type=sys.argv[type_index+1]
else:
    raise Exception("Please choose the way of calculating depth,it should be mean or median ")
###读取物种名
out_str=perbase_file.replace("\n","").split("/")[-1]
spe_name=out_str.split(".")[0]

def list_append(add_list,add_num,add_value):
    if(add_num==0):
        return add_list
    else:
        for i in range(0,add_num):
            add_list.append(add_value)
        return add_list

###构建输出函数
def mean_out_put(chr,region_list,depth_list):
    start_pos=0
    depth_sum=0
    depth_num=0
    size_num=0
    for i in range(0,len(region_list)):
        region=region_list[i]
        depth=depth_list[i]
        if(region[-1]-start_pos<win_size):
            if(i==len(region_list)-1):
                size_num+=1
                if (depth == 0):
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        if (depth_num == 0):
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                        else:
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
                else:
                    depth_num += (region[-1] - region[0])
                    depth_sum += ((region[-1] - region[0]) * depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
            else:
                if(depth==0):
                    continue
                else:
                    depth_num+=(region[-1]-region[0])
                    depth_sum+=((region[-1]-region[0])*depth)

        if(region[-1]-start_pos>win_size):
            if (i == len(region_list)-1):
                if (depth == 0):
                    size_num += 1
                    start_pos = win_size * size_num
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        if (depth_num == 0):
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                        else:
                            print("%s\t%d\t%d\t%f" % (
                            chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
                    depth_sum = 0
                    depth_num = 0
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                            print("%s\t%d\t%d\t%f" % (chr, win_size * size_num , win_size * (size_num+1), 0.0), file=f)
                else:
                    size_num += 1
                    start_pos = win_size * size_num
                    depth_num += (start_pos - region[0])
                    depth_sum += ((start_pos - region[0]) * depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                        chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
                    depth_sum = 0
                    depth_num = 0
                    depth_num += region[-1] - start_pos
                    depth_sum += ((region[-1] - start_pos) * depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                        chr, win_size * (size_num ), win_size * (size_num+1), (depth_sum * 1.0 / depth_num)), file=f)
            else:
                if (depth == 0):
                    size_num += 1
                    start_pos = win_size * size_num
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        if (depth_num == 0):
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                        else:
                            print("%s\t%d\t%d\t%f" % (
                            chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
                    depth_sum = 0
                    depth_num = 0
                else:
                    size_num += 1
                    start_pos = win_size * size_num
                    depth_num += (start_pos - region[0])
                    depth_sum += ((start_pos - region[0]) * depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                        chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
                    depth_sum = 0
                    depth_num = 0
                    depth_num += region[-1] - start_pos
                    depth_sum += ((region[-1] - start_pos) * depth)


        elif(region[-1]-start_pos==win_size):
            if(depth==0):
                size_num+=1
                start_pos=win_size*size_num
                with open(r"%s/%s.region.txt" % (out_dir,spe_name),'a+') as f:
                    if(depth_num==0):
                        print("%s\t%d\t%d\t%f" % (chr,win_size*(size_num-1),win_size*size_num,0.0),file=f)
                    else:
                        print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, (depth_sum*1.0/depth_num)), file=f)
                depth_sum=0
                depth_num=0
            else:
                size_num += 1
                start_pos = win_size * size_num
                depth_num+=(start_pos-region[0])
                depth_sum+=((start_pos-region[0])*depth)
                with open(r"%s/%s.region.txt" % (out_dir,spe_name),'a+') as f:
                    print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, (depth_sum * 1.0 / depth_num)), file=f)
                depth_sum=0
                depth_num=0
                depth_num+=region[-1]-start_pos
                depth_sum+=((region[-1]-start_pos)*depth)

def mid_out_put(chr,region_list,depth_list):
    start_pos = 0
    size_num = 0
    depth_num=0
    depth_value=[]
    for i in range(0,len(region_list)):
        region=region_list[i]
        depth=depth_list[i]
        if (region[-1] - start_pos < win_size):
            if (i == len(region_list) - 1):
                size_num += 1
                if (depth == 0):
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        if (depth_num == 0):
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                        else:
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)), file=f)
                else:
                    depth_num = (region[-1] - region[0])
                    depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                        chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)), file=f)
            else:
                if (depth == 0):
                    continue
                else:
                    depth_num=(region[-1] - region[0])
                    depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)


        if (region[-1] - start_pos > win_size):
            if (i == len(region_list) - 1):
                if (depth == 0):
                    size_num += 1
                    start_pos = win_size * size_num
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        if (depth_num == 0):
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                        else:
                            print("%s\t%d\t%d\t%f" % (
                                chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)),
                                  file=f)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (chr, win_size * size_num, win_size * (size_num + 1), 0.0), file=f)
                else:
                    size_num += 1
                    start_pos = win_size * size_num
                    depth_num =(start_pos - region[0])
                    depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                            chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)), file=f)
                    depth_value=[]
                    depth_num = 0
                    depth_num = region[-1] - start_pos
                    depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                            chr, win_size * (size_num), win_size * (size_num + 1), np.median(depth_value)),
                              file=f)
            else:
                if (depth == 0):
                    size_num += 1
                    start_pos = win_size * size_num
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        if (depth_num == 0):
                            print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                        else:
                            print("%s\t%d\t%d\t%f" % (
                                chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)),
                                  file=f)
                    depth_value=[]
                    depth_num = 0
                else:
                    size_num += 1
                    start_pos = win_size * size_num
                    depth_num = (start_pos - region[0])
                    depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)
                    with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                        print("%s\t%d\t%d\t%f" % (
                            chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)), file=f)
                    depth_value=[]
                    depth_num = 0
                    depth_num = region[-1] - start_pos
                    depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)


        elif (region[-1] - start_pos == win_size):
            if (depth == 0):
                size_num += 1
                start_pos = win_size * size_num
                with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                    if (depth_num == 0):
                        print("%s\t%d\t%d\t%f" % (chr, win_size * (size_num - 1), win_size * size_num, 0.0), file=f)
                    else:
                        print("%s\t%d\t%d\t%f" % (
                        chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)), file=f)
                depth_value=[]
                depth_num = 0
            else:
                size_num += 1
                start_pos = win_size * size_num
                depth_num = (start_pos - region[0])
                depth_value=list_append(add_list=depth_value,add_num=depth_num,add_value=depth)
                with open(r"%s/%s.region.txt" % (out_dir, spe_name), 'a+') as f:
                    print("%s\t%d\t%d\t%f" % (
                    chr, win_size * (size_num - 1), win_size * size_num, np.median(depth_value)), file=f)
                depth_value=[]
                depth_num = 0









###读取文件,将每个depth值的区段和depth值按照染色体来存储
chr_list=[]
###先将第一条染色体来添加
chr_list.append("LG01")
region_list=[]
depth_list=[]
chr_index=0
with gzip.open(perbase_file,'rt') as f:
    for line in f:
        out_str=line.replace("\n","").split("\t")
        if(out_str[0] not in chr_list):
            chr_list.append(out_str[0])
            chr=chr_list[chr_index]
            chr_index+=1
            if(cal_type=="mean"):
                mean_out_put(chr,region_list=region_list,depth_list=depth_list)
            if(cal_type=="median"):
                mid_out_put(chr,region_list=region_list,depth_list=depth_list)
            region_list=[]
            depth_list=[]
        else:
            region=[]
            region.append(int(out_str[1]))
            region.append(int(out_str[2]))
            depth_list.append(int(out_str[3]))
            region_list.append(region)

脚本中第一条染色体名称或者物种名的获取可能有些差异,但大体思路是这样的
最后运行脚本

python mosdepth_change.py -perbase_file $i -win_size 1000000 -out /projects01/DS20070100001/05.Subgenomes/01.split/reseq_depth/02.depths/mosdepth_new -type median

可以设置窗口大小,以及计算深度的方式(mean、median)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,270评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,489评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,630评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,906评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,928评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,718评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,442评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,345评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,802评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,984评论 3 337
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,117评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,810评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,462评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,011评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,139评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,377评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,060评论 2 355