samtools
Manipulate and analyze sequence alignment data
TLDR
Convert a SAM input file to BAM stream and save to file
Take input from stdin (-) and print the SAM header and any reads overlapping a specific region to stdout
Sort file and save to BAM (the output format is automatically determined from the output file's extension)
Index a sorted BAM file (creates sorted_input.bam.bai)
Print alignment statistics about a file
Count alignments to each index (chromosome/contig)
Merge multiple files
Split input file according to read groups
SYNOPSIS
samtools command [options] [arguments]
samtools operates via subcommands. To see a list of available subcommands, run samtools without any arguments or with samtools help. For help on a specific subcommand, use samtools help command.
Some frequently used subcommands include:
view: View, convert, and filter alignments.
sort: Sort alignments by coordinate or query name.
index: Index BAM/CRAM files for fast random access.
merge: Merge multiple sorted alignment files.
mpileup: Generate a pileup from alignment files.
flagstat: Compute flag statistics.
depth: Compute the depth of coverage.
faidx: Index/query FASTA reference sequence files.
PARAMETERS
(General Note)
Parameters are subcommand-specific. The options below are common for the view subcommand, used for extracting, filtering, and converting alignment files.
-b
Output in BAM format.
-S
Input is SAM format (default is auto-detected).
-h
Include header in the output.
-o
Write output to
-f
Only output alignments with all bits in
-F
Skip alignments with any bits in
-q
Only output alignments with mapping quality greater than or equal to
-@
Set number of additional threads to use for I/O and compression.
-L
Only output alignments overlapping regions specified in
DESCRIPTION
samtools is a comprehensive suite of utilities designed for interacting with high-throughput sequencing data in SAM, BAM, and CRAM formats. These formats store biological sequence alignments, which are crucial outputs from next-generation sequencing (NGS) experiments. The toolkit provides essential functionalities for managing and manipulating these large files efficiently. Key operations include viewing and converting file formats, sorting alignments by coordinate or query name, merging multiple alignment files, and creating indexes for fast random access to specific genomic regions. It also enables generation of pileups, calculation of alignment statistics, and extraction of subsets of data. Built upon the HTSlib library, samtools is a cornerstone tool in bioinformatics workflows, widely used for quality control, variant calling preparation, and general data exploration.
CAVEATS
samtools operations, especially sorting large files, can be memory and disk I/O intensive. Users should ensure sufficient system resources. For CRAM files and certain operations like mpileup, a reference genome FASTA file is often required. Random access to BAM/CRAM files relies on a separate index file (e.g., .bam.bai or .cram.crai) generated by samtools index, which must be present in the same directory as the alignment file.
HTSLIB FOUNDATION
samtools is built on top of the HTSlib library, a C library for reading/writing high-throughput sequencing data formats. This foundation provides samtools with its high performance and robust capabilities for handling large genomic datasets efficiently, often outperforming other tools for similar tasks.
INDEXING FOR PERFORMANCE
To enable fast random access to specific regions within large BAM or CRAM files, an index file (e.g., .bam.bai or .cram.crai) must be created using the samtools index subcommand. Without an index, operations requiring random access (like viewing a specific genomic region or generating pileups) would need to scan the entire file, which is highly inefficient for large datasets.
HISTORY
samtools was originally developed by Heng Li as a command-line utility for the MAQ alignment software package. As the SAM/BAM format gained widespread adoption as the standard for high-throughput sequencing data, samtools was spun off into a standalone project to provide a dedicated, high-performance toolkit for managing these files. Its development closely paralleled the growth of NGS, quickly becoming an indispensable tool in bioinformatics pipelines. It is now maintained as an open-source project, primarily on GitHub, and is tightly integrated with the HTSlib C library, which provides robust and efficient routines for SAM, BAM, and CRAM file manipulation.