Three brief examples¶
Here are three examples to show typical usage of pybedtools
. More
info can be found in the docstrings of pybedtools
methods and in the
Tutorial Contents.
You can also check out Shell script comparison for a simple
example of how pybedtools
can improve readability of your code with no
loss of speed compared to bash scripting.
Note
Please take the time to read and understand the conventions
pybedtools
uses to handle files with different coordinate systems
(e.g., 0-based BED files vs 1-based GFF files) which are described
here.
In summary,
- Integer values representing start/stop are always in 0-based
coordinates, regardless of file format. This means that all
Interval
objects can be treated identically, and greatly simplifies underlying code. - String values representing start/stop will use coordinates appropriate for the format (1-based for GFF; 0-based for BED).
Example 1: Save a BED file of intersections, with track line¶
This example saves a new BED file of intersections between your files mydata/snps.bed
and
mydata/exons.bed
, adding a track line to the output:
>>> import pybedtools
>>> a = pybedtools.BedTool('mydata/snps.bed')
>>> a.intersect('mydata/exons.bed').saveas('snps-in-exons.bed', trackline="track name='SNPs in exons' color=128,0,0")
Example 2: Intersections for a 3-way Venn diagram¶
This example gets values for a 3-way Venn diagram of overlaps. This
demonstrates operator overloading of BedTool
objects. It assumes that
you have the files a.bed
, b.bed
, and c.bed
in your current working
directory. If you'd like to use example files that come with
pybedtools
, then replace strings like 'a.bed'
with
pybedtools.example_filename('a.bed')
, which will retrieve the absolute path
to the example data file.:
>>> import pybedtools
>>> # set up 3 different bedtools
>>> a = pybedtools.BedTool('a.bed')
>>> b = pybedtools.BedTool('b.bed')
>>> c = pybedtools.BedTool('c.bed')
>>> (a-b-c).count() # unique to a
>>> (a+b-c).count() # in a and b, not c
>>> (a+b+c).count() # common to all
>>> # ... and so on, for all the combinations.
For more, see the pybedtools.scripts.venn_mpl
and
pybedtools.scripts.venn_gchart
scripts, which wrap this functionality in
command-line scripts to create Venn diagrams using either matplotlib or Google
Charts API respectively. Also see the pybedtools.contrib.venn_maker
module for a flexible interface to the VennDiagram R
package.
Example 3: Count reads in introns and exons, in parallel¶
This example shows how to count the number of reads in introns and exons in
parallel. It is somewhat more involved, but illustrates several additional
features of pybedtools
such as:
- BAM file support (for more, see Working with BAM files)
- indexing into Interval objects (for more, see Intervals)
- filtering (for more, see Filtering)
- streaming (for more, see Using BedTool objects as iterators/generators)
- ability to use parallel processing
The first listing has many explanatory comments, and the second listing shows
the same code with no comments to give more of a feel for pybedtools
.
import sys
import multiprocessing
import pybedtools
# get example GFF and BAM filenames
gff = pybedtools.example_filename('gdc.gff')
bam = pybedtools.example_filename('gdc.bam')
# Some GFF files have invalid entries -- like chromosomes with negative coords
# or features of length = 0. This line removes them and saves the result in a
# tempfile
g = pybedtools.BedTool(gff).remove_invalid().saveas()
# Next, we create a function to pass only features for a particular
# featuretype. This is similar to a "grep" operation when applied to every
# feature in a BedTool
def featuretype_filter(feature, featuretype):
if feature[2] == featuretype:
return True
return False
# This function will eventually be run in parallel, applying the filter above
# to several different BedTools simultaneously
def subset_featuretypes(featuretype):
result = g.filter(featuretype_filter, featuretype).saveas()
return pybedtools.BedTool(result.fn)
# This function performs the intersection of a BAM file with a GFF file and
# returns the total number of hits. It will eventually be run in parallel.
def count_reads_in_features(features_fn):
"""
Callback function to count reads in features
"""
# BAM files are auto-detected; no need for an `abam` argument. Here we
# construct a new BedTool out of the BAM file and intersect it with the
# features filename.
# We use stream=True so that no intermediate tempfile is
# created, and bed=True so that the .count() method can iterate through the
# resulting streamed BedTool.
return pybedtools.BedTool(bam).intersect(
b=features_fn,
stream=True).count()
# Set up a pool of workers for parallel processing
pool = multiprocessing.Pool()
# Create separate files for introns and exons, using the function we defined
# above
featuretypes = ('intron', 'exon')
introns, exons = pool.map(subset_featuretypes, featuretypes)
# Perform some genome algebra to get unique and shared intron/exon regions.
# Here we keep only the filename of the results, which is safer than an entire
# BedTool for passing around in parallel computations.
exon_only = exons.subtract(introns).merge().remove_invalid().saveas().fn
intron_only = introns.subtract(exons).merge().remove_invalid().saveas().fn
intron_and_exon = exons.intersect(introns).merge().remove_invalid().saveas().fn
# Do intersections with BAM file in parallel, using the other function we
# defined above
features = (exon_only, intron_only, intron_and_exon)
results = pool.map(count_reads_in_features, features)
# Print the results
labels = (' exon only:',
' intron only:',
'intron and exon:')
for label, reads in zip(labels, results):
sys.stdout.write('%s %s\n' % (label, reads))
Here's the same code but with no comments:
import sys
import multiprocessing
import pybedtools
gff = pybedtools.example_filename('gdc.gff')
bam = pybedtools.example_filename('gdc.bam')
g = pybedtools.BedTool(gff).remove_invalid().saveas()
def featuretype_filter(feature, featuretype):
if feature[2] == featuretype:
return True
return False
def subset_featuretypes(featuretype):
result = g.filter(featuretype_filter, featuretype).saveas()
return pybedtools.BedTool(result.fn)
def count_reads_in_features(features_fn):
"""
Callback function to count reads in features
"""
return pybedtools.BedTool(bam).intersect(
b=features_fn,
stream=True).count()
pool = multiprocessing.Pool()
featuretypes = ('intron', 'exon')
introns, exons = pool.map(subset_featuretypes, featuretypes)
exon_only = exons.subtract(introns).merge().remove_invalid().saveas().fn
intron_only = introns.subtract(exons).merge().remove_invalid().saveas().fn
intron_and_exon = exons.intersect(introns).merge().remove_invalid().saveas().fn
features = (exon_only, intron_only, intron_and_exon)
results = pool.map(count_reads_in_features, features)
labels = (' exon only:',
' intron only:',
'intron and exon:')
for label, reads in zip(labels, results):
sys.stdout.write('%s %s\n' % (label, reads))
For more on using pybedtools
, continue on to the Tutorial Contents . .
.