LinuxCommandLibrary

bwa

Align DNA sequences to a reference genome

TLDR

Index the reference genome

$ bwa index [path/to/reference.fa]
copy

Map single-end reads (sequences) to indexed genome using 32 [t]hreads and compress the result to save space
$ bwa mem -t 32 [path/to/reference.fa] [path/to/read_single_end.fq.gz] | gzip > [path/to/alignment_single_end.sam.gz]
copy

Map pair-end reads (sequences) to the indexed genome using 32 [t]hreads and compress the result to save space
$ bwa mem -t 32 [path/to/reference.fa] [path/to/read_pair_end_1.fq.gz] [path/to/read_pair_end_2.fq.gz] | gzip > [path/to/alignment_pair_end.sam.gz]
copy

Map pair-end reads (sequences) to the indexed genome using 32 [t]hreads with [M]arking shorter split hits as secondary for output SAM file compatibility in Picard software and compress the result
$ bwa mem -M -t 32 [path/to/reference.fa] [path/to/read_pair_end_1.fq.gz] [path/to/read_pair_end_2.fq.gz] | gzip > [path/to/alignment_pair_end.sam.gz]
copy

Map pair-end reads (sequences) to indexed genome using 32 [t]hreads with FASTA/Q [C]omments (e.g. BC:Z:CGTAC) appending to a compressed result
$ bwa mem -C -t 32 [path/to/reference.fa] [path/to/read_pair_end_1.fq.gz] [path/to/read_pair_end_2.fq.gz] | gzip > [path/to/alignment_pair_end.sam.gz]
copy

SYNOPSIS

bwa <command> [options]

Examples:
bwa index [-p prefix] [-a algo] <ref.fa>
bwa mem [options] <ref> <in1.fq> [<in2.fq>]

PARAMETERS

index
    Index reference FASTA sequences into BWT and auxiliary files

mem
    Run BWA-MEM algorithm: seed, chain, align reads to SAM

aln
    Legacy: BWA-ALN gapped alignment to .sai files (short reads)

samse
    Convert single-end .sai alignments to SAM

sampe
    Convert paired-end .sai alignments to SAM

bwasw
    BWA-SW for long-query gapped alignment

-t N
    Number of threads (default: 1)

-R STR
    Read group header line (SAM @RG)

-c INT
    Skip Smith-Waterman (mem; faster, less accurate)

-M
    Mark shorter split hits as secondary (mem; GATK compat)

DESCRIPTION

BWA (Burrows-Wheeler Aligner) is a popular open-source software tool for mapping DNA sequencing reads to a reference genome. It excels in speed, accuracy, and low memory usage, leveraging the Burrows-Wheeler Transform (BWT) and FM-index for efficient queries.

Originally designed for short Illumina reads, modern versions like bwa mem handle longer reads from PacBio, Oxford Nanopore, and paired-end data. It supports base quality scores, clipping, and gapped alignment.

Workflow typically starts with bwa index to build compact index files (.bwt, .pac, etc.) from a reference FASTA file. Alignment subcommands generate SAM/BAM output compatible with samtools and GATK pipelines.

bwa mem is the flagship algorithm, combining seeding, chaining, and Smith-Waterman for optimal results. Legacy modes (aln/sampe) are faster for very short reads but deprecated for new projects.

Developed for bioinformatics, BWA is essential in NGS (Next-Generation Sequencing) analysis, variant calling, and RNA-seq. It's highly cited, with ongoing updates for emerging technologies.

CAVEATS

Best for reads <500bp; use minimap2 for ultra-long reads. Requires pre-indexed reference. Legacy aln/sampe faster but less accurate than mem. High-memory for large genomes.

QUICK EXAMPLE

Index: bwa index ref.fa
Align PE: bwa mem -t 8 ref.fa reads_R1.fq reads_R2.fq | samtools sort -o aligned.bam

OUTPUT

Produces SAM format with MAPQ, CIGAR, MD tag. Pipe to samtools view/sort for BAM.

HISTORY

Created by Heng Li in 2009 at Wellcome Sanger Institute. Initial aln algorithm for short reads. BWA-MEM (v0.7, 2013) revolutionized with chaining for longer reads. Maintained on GitHub; v0.7.17 (2018) last major release.

SEE ALSO

bowtie2(1), minimap2(1), samtools(1), gatk(1)

Copied to clipboard