tabix

Index and query indexed, position-sorted text files

SYNOPSIS

tabix [options] <file.gz> [region1 [region2 [...]]]
tabix -D <file.gz>
tabix -l <file.gz>

-p
    Specify preset for indexing: bed (0-based, chr, start, end), gff (1-based, chr, start, end), vcf (1-based, chr, pos), sam (1-based, chr, pos). Automatically sets column indices.

-s
    Column number for sequence name (1-based). Default is 1.

-b
    Column number for start coordinate (1-based). Default is 2.

-e
    Column number for end coordinate (1-based). Default is 3.

-c
    Skip lines beginning with this character (e.g., '#'). Default is '#'.

-S
    Skip first lines (e.g., header lines) before indexing.

-B
    Enable 0-based indexing (BED-like coordinates). Default is 1-based.

-f
    Force overwrite of existing index file if it already exists.

-r
    Retrieve regions specified in a file (one region per line).

-D
    Dump the content of the index file to standard output for debugging.

-l
    List sequence names (chromosomes/contigs) from the index.

-H
    Print the header lines from the input file during query (lines before -S and starting with -c).

-h
    Suppress header lines from output during query.

-v
    Verbose output, showing progress and detailed information.

DESCRIPTION

tabix is a utility designed for indexing and retrieving data from large, tab-delimited files, particularly those containing genomic coordinates like BED, GFF, VCF, and SAM formats.

It creates a binary index file (.tbi) that allows for rapid, random access to specific regions within the compressed data file without needing to load the entire file into memory. This is crucial for bioinformatics applications dealing with massive datasets.

For optimal performance, input files should be compressed with BGZF (Block-Gzip) and sorted by genomic coordinate. tabix enables efficient extraction of lines that overlap with specified genomic regions, making it an indispensable tool for data exploration and downstream analysis in genomics.

CAVEATS

The input file must be compressed with BGZF (Block-Gzip) for efficient indexing and querying. Standard gzip compression will not work effectively.

The input file also must be sorted by sequence name and then by start coordinate. Unsorted files will result in incorrect indexing and query results.

The .tbi index file must reside in the same directory as the compressed data file for tabix to be able to query it.

HISTORY

tabix was developed by Heng Li as part of the HTSlib project, which also underpins samtools and bcftools. It emerged as a critical tool for handling the increasing size of genomic datasets, providing a standard and efficient method for indexing and querying large, coordinate-sorted, tab-delimited files. Its design, leveraging the BGZF compression format, revolutionized how researchers access and process genomic information, becoming an industry standard for formats like VCF and BED.