LinuxCommandLibrary

samtools

Manipulate and analyze sequence alignment data

TLDR

Convert a SAM input file to BAM stream and save to file

$ samtools view -S [[-b|--bam]] [input.sam] > [output.bam]
copy

Take input from stdin (-) and print the SAM header and any reads overlapping a specific region to stdout
$ [other_command] | samtools view [[-h|--with-header]] - chromosome:start-end
copy

Sort file and save to BAM (the output format is automatically determined from the output file's extension)
$ samtools sort [input] [[-o|--output]] [output.bam]
copy

Index a sorted BAM file (creates sorted_input.bam.bai)
$ samtools index [sorted_input.bam]
copy

Print alignment statistics about a file
$ samtools flagstat [sorted_input]
copy

Count alignments to each index (chromosome/contig)
$ samtools idxstats [sorted_indexed_input]
copy

Merge multiple files
$ samtools merge [output] [input1 input2 ...]
copy

Split input file according to read groups
$ samtools split [merged_input]
copy

SYNOPSIS

samtools command [options] [arguments]

samtools operates via subcommands. To see a list of available subcommands, run samtools without any arguments or with samtools help. For help on a specific subcommand, use samtools help command.

Some frequently used subcommands include:
  view: View, convert, and filter alignments.
  sort: Sort alignments by coordinate or query name.
  index: Index BAM/CRAM files for fast random access.
  merge: Merge multiple sorted alignment files.
  mpileup: Generate a pileup from alignment files.
  flagstat: Compute flag statistics.
  depth: Compute the depth of coverage.
  faidx: Index/query FASTA reference sequence files.

PARAMETERS

(General Note)
    Parameters are subcommand-specific. The options below are common for the view subcommand, used for extracting, filtering, and converting alignment files.

-b
    Output in BAM format.

-S
    Input is SAM format (default is auto-detected).

-h
    Include header in the output.

-o
    Write output to (default: standard output).

-f
    Only output alignments with all bits in present in the FLAG field.

-F
    Skip alignments with any bits in present in the FLAG field.

-q
    Only output alignments with mapping quality greater than or equal to .

-@
    Set number of additional threads to use for I/O and compression.

-L
    Only output alignments overlapping regions specified in (BED format).

DESCRIPTION

samtools is a comprehensive suite of utilities designed for interacting with high-throughput sequencing data in SAM, BAM, and CRAM formats. These formats store biological sequence alignments, which are crucial outputs from next-generation sequencing (NGS) experiments. The toolkit provides essential functionalities for managing and manipulating these large files efficiently. Key operations include viewing and converting file formats, sorting alignments by coordinate or query name, merging multiple alignment files, and creating indexes for fast random access to specific genomic regions. It also enables generation of pileups, calculation of alignment statistics, and extraction of subsets of data. Built upon the HTSlib library, samtools is a cornerstone tool in bioinformatics workflows, widely used for quality control, variant calling preparation, and general data exploration.

CAVEATS

samtools operations, especially sorting large files, can be memory and disk I/O intensive. Users should ensure sufficient system resources. For CRAM files and certain operations like mpileup, a reference genome FASTA file is often required. Random access to BAM/CRAM files relies on a separate index file (e.g., .bam.bai or .cram.crai) generated by samtools index, which must be present in the same directory as the alignment file.

HTSLIB FOUNDATION

samtools is built on top of the HTSlib library, a C library for reading/writing high-throughput sequencing data formats. This foundation provides samtools with its high performance and robust capabilities for handling large genomic datasets efficiently, often outperforming other tools for similar tasks.

INDEXING FOR PERFORMANCE

To enable fast random access to specific regions within large BAM or CRAM files, an index file (e.g., .bam.bai or .cram.crai) must be created using the samtools index subcommand. Without an index, operations requiring random access (like viewing a specific genomic region or generating pileups) would need to scan the entire file, which is highly inefficient for large datasets.

HISTORY

samtools was originally developed by Heng Li as a command-line utility for the MAQ alignment software package. As the SAM/BAM format gained widespread adoption as the standard for high-throughput sequencing data, samtools was spun off into a standalone project to provide a dedicated, high-performance toolkit for managing these files. Its development closely paralleled the growth of NGS, quickly becoming an indispensable tool in bioinformatics pipelines. It is now maintained as an open-source project, primarily on GitHub, and is tightly integrated with the HTSlib C library, which provides robust and efficient routines for SAM, BAM, and CRAM file manipulation.

SEE ALSO

bcftools(1), tabix(1), bedtools(1), htslib(3)

Copied to clipboard