bwa
Align DNA sequences to a reference genome
TLDR
Index the reference genome
Map single-end reads (sequences) to indexed genome using 32 [t]hreads and compress the result to save space
Map pair-end reads (sequences) to the indexed genome using 32 [t]hreads and compress the result to save space
Map pair-end reads (sequences) to the indexed genome using 32 [t]hreads with [M]arking shorter split hits as secondary for output SAM file compatibility in Picard software and compress the result
Map pair-end reads (sequences) to indexed genome using 32 [t]hreads with FASTA/Q [C]omments (e.g. BC:Z:CGTAC) appending to a compressed result
SYNOPSIS
bwa command [options] [arguments]
Common commands:
bwa index [-a algoType] reference.fa
bwa mem [-t nThreads] [-k minSeedLen] [-M] ... ref.fa reads.fq [reads2.fq]
PARAMETERS
-t
Number of threads to use for alignment (BWA-MEM).
-k
Minimum seed length. Shorter seeds increase sensitivity but reduce speed (BWA-MEM).
-M
Mark shorter split hits as secondary alignments (recommended for Picard compatibility).
-R
Add a read group header to the SAM output. E.g., @RG\tID:foo\tSM:bar.
-p
Assume interleaved input for paired-end reads (BWA-MEM).
-a
Algorithm for constructing the BWT index (e.g., 'is' for fast, 'bwtsw' for longer genomes). (bwa index)
DESCRIPTION
BWA (Burrows-Wheeler Aligner) is a widely used software package for aligning low-divergent sequences against a large reference genome, such as the human genome. It implements three different algorithms: BWA-backtrack (via bwa aln), BWA-SW (via bwa bwasw), and BWA-MEM (via bwa mem). BWA-MEM is the latest and generally recommended algorithm for aligning queries from 70bp to 1Mbp, especially for whole-genome sequencing (WGS) and exome sequencing data. It works efficiently with both Illumina and PacBio data. BWA-backtrack is designed for Illumina sequence reads up to 100bp, while BWA-SW is for longer reads (>=70bp) and has a more complex scoring system. The typical workflow involves an initial indexing step of the reference genome, followed by the alignment (mapping) of reads, and then often post-processing of the output (usually in SAM/BAM format). Its efficiency, accuracy, and versatility have made it a cornerstone tool in bioinformatics pipelines for next-generation sequencing data analysis.
CAVEATS
While highly optimized, bwa index can be memory-intensive for very large reference genomes. Users should be aware that bwa mem is the generally recommended algorithm for modern sequencing data, while older algorithms like bwa aln and bwa bwasw are largely deprecated for general use but might still be relevant for specific historical datasets or very short reads (<70bp). The output in SAM/BAM format often requires post-processing with tools like Samtools for sorting, indexing, and converting to BAM for downstream analysis.
TYPICAL WORKFLOW
A common BWA workflow involves two main steps:
1. Indexing: Creating a BWA index for the reference genome using bwa index.
2. Alignment: Aligning sequencing reads to the indexed reference genome using bwa mem (for paired-end or single-end reads), or bwa aln followed by bwa samse/sampe (for older workflows).
OUTPUT FORMAT
BWA outputs alignments in the Sequence Alignment/Map (SAM) format by default, which can be piped to samtools to convert to its binary equivalent, BAM, for more efficient storage and subsequent processing.
HISTORY
BWA was originally developed by Heng Li. Its initial release focused on bwa aln (BWA-backtrack), designed for short Illumina reads. As sequencing technology advanced, bwa bwasw was introduced to handle longer reads. The most significant development was the introduction of bwa mem in 2013, which quickly became the de facto standard for aligning next-generation sequencing reads due to its superior performance for reads ranging from 70bp to 1Mbp, making it suitable for a wider range of sequencing platforms and applications.
SEE ALSO
samtools(1), bowtie2(1), minimap2(1)