pybedtools文档-平行在外显子和内含子中计算reads数

本例子展示的是如何在内含子和外显子中统计reads数量。不但包括了这个内容，还包括了其他的命名展示。例如

1、BAM file support (for more, see Working with BAM files)

2、indexing into Interval objects (for more, see Intervals)

3、filtering (for more, see Filtering)

4、streaming (for more, see Using BedTool objects as iterators/generators)

5、ability to use parallel processing

下面是含有注释信息的命令行展示

import sys

import multiprocessing

import pybedtools

# get example GFF and BAM filenames 获得例子中的GFF和BAM的内容

gff = pybedtools.example_filename('gdc.gff')

bam = pybedtools.example_filename('gdc.bam')

# Some GFF files have invalid entries -- like chromosomes with negative coords GFF文件中含有不实的内容，例如染色为负数坐标的信息

# orfeatures of length = 0. This lineremoves them and saves the result in a tempfile 或者特征值的长度为0，可以使用命令行去除这些无意义的内容，并且存储为另外一个文件。

g = pybedtools.BedTool(gff).remove_invalid().saveas()

# Next, we create a function to pass only features for a particular featuretype. This is similar to a"grep" operation when applied to every#feature in a BedTool

~~随后，我们构建了一个函数用来只显示符合要求的特征，类似于grep命令~~

def featuretype_filter(feature, featuretype):

    if feature[2] == featuretype:

        return True

    return False

# This function will eventually be run in parallel, applying the filter above to several different BedTools simultaneously 多种不同的BedTools同时选择

def subset_featuretypes(featuretype):

    result = g.filter(featuretype_filter, featuretype).saveas()

    return pybedtools.BedTool(result.fn)

~~# Thisfunction performs the intersection of a BAM file with a GFF file and returns the total number of hits. Itwill eventually be run in parallel.~~

def count_reads_in_features(features_fn):

    """

    Callback function to count reads infeatures

    """

  # BAM files are auto-detected; no need foran `abam` argument. Here we # construct a new BedTool out of the BAM file and intersect it with the features filename. We use stream=True so that no intermediate tempfile is created, and bed=True so that the.count() method can iterate through the resulting streamed BedTool.

    return pybedtools.BedTool(bam).intersect(b=features_fn, stream=True).count()

# Set up a pool of workers for parallel processing pool = multiprocessing.Pool()

# Create separate files for introns and exons, using the function we defined above

featuretypes = ('intron', 'exon')

introns, exons = pool.map(subset_featuretypes, featuretypes)

# Perform some genome algebra to get unique and shared intron/exon regions. Here we keep only the filename of the results, which is safer than an entire BedTool for passing around in parallel computations.

exon_only = exons.subtract(introns).merge().remove_invalid().saveas().fn

intron_only = introns.subtract(exons).merge().remove_invalid().saveas().fn

intron_and_exon = exons.intersect(introns).merge().remove_invalid().saveas().fn

# Do intersections with BAM file in parallel, using the other function we defined above

features = (exon_only, intron_only, intron_and_exon)

results = pool.map(count_reads_in_features, features)

# Print the results

labels = (' exon only:', ' intron only:', 'intron and exon:')

for label, reads in zip(labels, results):

    sys.stdout.write('%s %s\n' % (label, reads))

pybedtools文档-平行在外显子和内含子中计算reads数

1、BAM file support (for more, see Working with BAM files)

2、indexing into Interval objects (for more, see Intervals)

3、filtering (for more, see Filtering)

4、streaming (for more, see Using BedTool objects as iterators/generators)

5、ability to use parallel processing

推荐阅读更多精彩内容