bedtools
Compare and manipulate genomic intervals
TLDR
Intersect file [a] and file(s) [b] regarding the sequences' [s]trand and save the result to a specific file
Intersect two files with a [l]eft [o]uter [j]oin, i.e. report each feature from file1 and NULL if no overlap with file2
Using more efficient algorithm to intersect two pre-sorted files
[g]roup a file based on the first three and the fifth [c]olumn and apply the sum [o]peration on the sixth column
Convert bam-formatted [i]nput file to a bed-formatted one
Find for all features in file1.bed the closest one in file2.bed and write their [d]istance in an extra column (input files must be sorted)
SYNOPSIS
bedtools <subcommand> [options] [arguments]
Example: bedtools intersect -a file_A.bed -b file_B.bed
PARAMETERS
intersect
Finds overlapping intervals between two BED/GFF/VCF files (A and B).
merge
Combines overlapping or adjacent intervals in a single file into a consolidated set.
subtract
Removes the portions of intervals in file A that overlap with intervals in file B.
flank
Creates new intervals that flank (upstream or downstream) existing intervals relative to strand.
slop
Extends the start and/or end of intervals by a specified number of bases.
sort
Sorts BED/GFF/VCF files by chromosome and then by start position.
closest
Finds the closest interval in file B for each interval in file A, reporting distance.
coverage
Computes the coverage of intervals in file A by intervals in file B, reporting depth and fraction.
getfasta
Extracts DNA sequences corresponding to intervals from a FASTA reference genome.
annotate
Adds annotations (e.g., counts, sums, means) to intervals from another file based on overlaps.
DESCRIPTION
bedtools is a comprehensive, open-source suite of command-line utilities designed for manipulating genomic features and intervals. Developed by Aaron Quinlan and his team, it provides efficient tools for common genomics tasks such as intersecting, merging, counting, and comparing genomic regions across various file formats, including BED, GFF, VCF, and BAM. Its modular design allows users to combine multiple operations through piping, making it an indispensable tool for bioinformatics researchers and computational biologists. bedtools excels in speed and versatility, facilitating complex analyses on large-scale genomic datasets, from identifying overlaps between gene annotations and sequencing reads to performing advanced feature arithmetic.
CAVEATS
Users must ensure consistency in chromosome naming conventions (e.g., 'chr1' vs. '1') across all input files. bedtools typically uses 0-based start coordinates and 1-based end coordinates for BED files, which can be a common source of off-by-one errors if not handled carefully. Performance for large files can be optimized significantly by pre-sorting input files, especially for operations like intersect and merge.
COORDINATE SYSTEM
BED files, which are central to bedtools operations, typically use a 0-based start coordinate and a 1-based end coordinate. This means the start position is inclusive (the first base of the interval), and the end position is exclusive (the base *after* the end of the interval). For example, a 10-base interval from 100 to 109 would be represented as 'chr1\t100\t110' in BED format. Understanding this convention is crucial to avoid off-by-one errors.
INPUT/OUTPUT FORMATS & PIPING
bedtools supports a wide range of genomic file formats beyond BED, including GFF/GTF, VCF, BAM, and FASTA. Its design emphasizes standard input/output (stdin/stdout), allowing seamless integration into complex command-line pipelines. Users can pipe the output of one bedtools subcommand directly as input to another, or combine it with other Unix utilities like grep, awk, and sort for highly customized analyses.
HISTORY
bedtools was initially developed by Aaron R. Quinlan and his team at the University of Utah and later at the University of Virginia. The first major release was around 2009-2010. It quickly gained widespread adoption in the genomics community due to its speed, comprehensive functionality, and ease of use in command-line scripting. Its development is ongoing, with new features and optimizations regularly added, reflecting the evolving needs of genomics research. It is a cornerstone tool in many bioinformatics pipelines.