LinuxCommandLibrary

bedtools

Compare and manipulate genomic intervals

TLDR

Intersect file [a] and file(s) [b] regarding the sequences' [s]trand and save the result to a specific file

$ bedtools intersect -a [path/to/file_A] -b [path/to/file_B1 path/to/file_B2 ...] -s > [path/to/output_file]
copy

Intersect two files with a [l]eft [o]uter [j]oin, i.e. report each feature from file1 and NULL if no overlap with file2
$ bedtools intersect -a [path/to/file1] -b [path/to/file2] -loj > [path/to/output_file]
copy

Using more efficient algorithm to intersect two pre-sorted files
$ bedtools intersect -a [path/to/file1] -b [path/to/file2] -sorted > [path/to/output_file]
copy

[g]roup a file based on the first three and the fifth [c]olumn and apply the sum [o]peration on the sixth column
$ bedtools groupby -i [path/to/file] -c 1-3,5 -g 6 -o sum
copy

Convert bam-formatted [i]nput file to a bed-formatted one
$ bedtools bamtobed -i [path/to/file.bam] > [path/to/file.bed]
copy

Find for all features in file1.bed the closest one in file2.bed and write their [d]istance in an extra column (input files must be sorted)
$ bedtools closest -a [path/to/file1.bed] -b [path/to/file2.bed] -d
copy

SYNOPSIS

bedtools <subcommand> [options] [arguments]

Example: bedtools intersect -a file_A.bed -b file_B.bed

PARAMETERS

intersect
    Finds overlapping intervals between two BED/GFF/VCF files (A and B).

merge
    Combines overlapping or adjacent intervals in a single file into a consolidated set.

subtract
    Removes the portions of intervals in file A that overlap with intervals in file B.

flank
    Creates new intervals that flank (upstream or downstream) existing intervals relative to strand.

slop
    Extends the start and/or end of intervals by a specified number of bases.

sort
    Sorts BED/GFF/VCF files by chromosome and then by start position.

closest
    Finds the closest interval in file B for each interval in file A, reporting distance.

coverage
    Computes the coverage of intervals in file A by intervals in file B, reporting depth and fraction.

getfasta
    Extracts DNA sequences corresponding to intervals from a FASTA reference genome.

annotate
    Adds annotations (e.g., counts, sums, means) to intervals from another file based on overlaps.

DESCRIPTION

bedtools is a comprehensive, open-source suite of command-line utilities designed for manipulating genomic features and intervals. Developed by Aaron Quinlan and his team, it provides efficient tools for common genomics tasks such as intersecting, merging, counting, and comparing genomic regions across various file formats, including BED, GFF, VCF, and BAM. Its modular design allows users to combine multiple operations through piping, making it an indispensable tool for bioinformatics researchers and computational biologists. bedtools excels in speed and versatility, facilitating complex analyses on large-scale genomic datasets, from identifying overlaps between gene annotations and sequencing reads to performing advanced feature arithmetic.

CAVEATS

Users must ensure consistency in chromosome naming conventions (e.g., 'chr1' vs. '1') across all input files. bedtools typically uses 0-based start coordinates and 1-based end coordinates for BED files, which can be a common source of off-by-one errors if not handled carefully. Performance for large files can be optimized significantly by pre-sorting input files, especially for operations like intersect and merge.

COORDINATE SYSTEM

BED files, which are central to bedtools operations, typically use a 0-based start coordinate and a 1-based end coordinate. This means the start position is inclusive (the first base of the interval), and the end position is exclusive (the base *after* the end of the interval). For example, a 10-base interval from 100 to 109 would be represented as 'chr1\t100\t110' in BED format. Understanding this convention is crucial to avoid off-by-one errors.

INPUT/OUTPUT FORMATS & PIPING

bedtools supports a wide range of genomic file formats beyond BED, including GFF/GTF, VCF, BAM, and FASTA. Its design emphasizes standard input/output (stdin/stdout), allowing seamless integration into complex command-line pipelines. Users can pipe the output of one bedtools subcommand directly as input to another, or combine it with other Unix utilities like grep, awk, and sort for highly customized analyses.

HISTORY

bedtools was initially developed by Aaron R. Quinlan and his team at the University of Utah and later at the University of Virginia. The first major release was around 2009-2010. It quickly gained widespread adoption in the genomics community due to its speed, comprehensive functionality, and ease of use in command-line scripting. Its development is ongoing, with new features and optimizations regularly added, reflecting the evolving needs of genomics research. It is a cornerstone tool in many bioinformatics pipelines.

SEE ALSO

grep(1), awk(1), sort(1), samtools(1), vcftools(1), htseq-count

Copied to clipboard