LinuxCommandLibrary

bedtools

Compare and manipulate genomic intervals

TLDR

Intersect file [a] and file(s) [b] regarding the sequences' [s]trand and save the result to a specific file

$ bedtools intersect -a [path/to/file_A] -b [path/to/file_B1 path/to/file_B2 ...] -s > [path/to/output_file]
copy

Intersect two files with a [l]eft [o]uter [j]oin, i.e. report each feature from file1 and NULL if no overlap with file2
$ bedtools intersect -a [path/to/file1] -b [path/to/file2] -loj > [path/to/output_file]
copy

Using more efficient algorithm to intersect two pre-sorted files
$ bedtools intersect -a [path/to/file1] -b [path/to/file2] -sorted > [path/to/output_file]
copy

[g]roup a file based on the first three and the fifth [c]olumn and apply the sum [o]peration on the sixth column
$ bedtools groupby -i [path/to/file] -c 1-3,5 -g 6 -o sum
copy

Convert bam-formatted [i]nput file to a bed-formatted one
$ bedtools bamtobed -i [path/to/file.bam] > [path/to/file.bed]
copy

Find for all features in file1.bed the closest one in file2.bed and write their [d]istance in an extra column (input files must be sorted)
$ bedtools closest -a [path/to/file1.bed] -b [path/to/file2.bed] -d
copy

SYNOPSIS

bedtools <subcommand> [options] <input files>

PARAMETERS

-h, --help
    Show help message and exit

-version
    Print version information

-list
    List all available subcommands

-ci, --check-intervals
    Check if intervals are proper (chrom/start/end)

-iobuf N
    Input/output buffer size (e.g., 500M, default 128M)

DESCRIPTION

Bedtools is a fast, flexible collection of utilities for genome arithmetic (e.g., intersect, merge, coverage) that use BED coordinates as input. It enables comparisons between data of arbitrary types (e.g., locations vs. locations, locations vs. reads) without a centralized database, making it ideal for high-throughput genomic analyses.

Key features include support for multiple formats (BED, GFF, VCF, BAM, BigWig), parallel processing for speed, and dozens of subcommands like intersect, closest, bamtobed, and multibamcoverage. Designed for biologists, it avoids programming by chaining commands in pipelines with Unix tools like sort and awk.

Common workflows: finding overlaps between peaks and genes, calculating coverage from BAM alignments, merging intervals. It's memory-efficient for large datasets but shines on sorted inputs. Widely used in NGS pipelines for ChIP-seq, RNA-seq, and variant calling.

CAVEATS

Most subcommands require sorted input by chromosome and position (use sort -k1,1 -k2,2n). Large BAM/BED files can be memory-intensive; use -iobuf to tune. Outputs unsorted unless -sorted specified.

POPULAR SUBCOMMANDS

intersect: Overlaps between files
merge: Combine overlapping intervals
closest: Nearest feature search
bamtobed: BAM to BED conversion
coverage: Read depth per interval

INPUT REQUIREMENTS

Files in BED6+ format; chrom names must match (e.g., chr1 vs 1). Use bedtools sort first.

HISTORY

Developed by Aaron Quinlan (2008-2009) with Ira Hall; first release 2009. Evolved from BedTools to bedtools v2 (2012+), now at v2.31.1 (2024). Standard in bioinformatics for 15+ years, with 1000s of citations.

SEE ALSO

samtools(1), bcftools(1), sort(1), awk(1)

Copied to clipboard