tabix
Index and query indexed, position-sorted text files
SYNOPSIS
tabix [options]
PARAMETERS
-p
Set file type.
Options include: gff (GFF3/GTF), vcf (VCF), bed (BED). Defaults to gff.
-s
Sequence name column (1-based index). Defaults to 1.
-b
Start position column (1-based index). Defaults to 2.
-e
End position column (1-based index). Defaults to 3.
-S
Skip first line.
-c
Symbol for comment lines.
-0
Zero-based coordinate.
-f
Force re-indexing even if the index file already exists.
-h
Print header to standard output.
-H
Do not assume header.
DESCRIPTION
Tabix is a command-line tool for indexing BGZF compressed TAB-separated files, such as VCF or GFF files. It creates an index file (.tbi) that allows for efficient retrieval of specific regions from the compressed data. This is crucial for quickly accessing data within large genomic datasets without needing to decompress the entire file. Tabix is specifically designed for files that are position-sorted and compressed using BGZF (Blocked GNU Zip Format), a variant of gzip that allows for random access.
The indexing process relies on the chromosomal coordinates and positions stored within the file. The index created contains offsets into the compressed data, allowing programs like samtools or custom scripts to fetch specific regions of interest rapidly. Tabix provides different indexing options based on the type of input file, accommodating different column arrangements and formats. Properly indexed files become significantly easier and faster to use in downstream analysis pipelines, especially when dealing with large datasets. Using Tabix enhances the performance and scalability of genomic data analysis.
CAVEATS
The input file must be sorted by chromosomal position before indexing with Tabix. Improper sorting will lead to incorrect indexing and retrieval. The file should also be BGZF compressed. Using regular gzip will not work.
INDEXING PROCESS
The Tabix indexing process involves scanning the input file, identifying chromosomal coordinates and positions, and storing offsets into the compressed data within the index file (.tbi). This index allows subsequent queries to quickly locate the relevant compressed blocks without decompressing the entire file.
QUERYING INDEXED FILES
Once a file is indexed, you can use Tabix itself, or tools like samtools, to query specific regions. The query specifies the chromosome and the start and end positions. The tool then uses the index to identify the relevant compressed blocks and extracts the data corresponding to the specified region.
HISTORY
Tabix was developed to provide efficient indexing for large genomic data files. It builds on the BGZF format and facilitates quick access to specific genomic regions. It is widely used in bioinformatics pipelines for processing VCF, GFF, and BED files.