LinuxCommandLibrary

bgzip

Compress genomic data files with indexing

SYNOPSIS


bgzip [options] [file...]

PARAMETERS

-c, --stdout
    Write output to standard output, keep original files unchanged.

-d, --decompress
    Decompress files.

-i, --index
    Create a TBI index for the compressed file (requires tabix compatibility, data must be coordinate-sorted).

-I file, --index-filename file
    Specify the filename for the index (used with -i).

-f, --force
    Overwrite existing output files without prompting.

-r, --reindex
    Recreate TBI index for an existing BGZF file (requires data to be coordinate-sorted).

-s num, --nthreads num
    Number of threads to use for compression/decompression.

-t, --test
    Test integrity of compressed file.

-b offset, --offset offset
    Decompress from offset (for testing or partial decompression).

-l, --fast
    Use faster compression (produces larger files).

-o file, --output file
    Write output to specified file.

-@ level, --compression-level level
    Compression level (1-9, where 9 is best compression).

-k, --keep
    Keep (don't delete) input files during compression.

-h, --help
    Display help message and exit.

DESCRIPTION

bgzip is a utility for compressing files into the Block Gzip Format (BGZF). Unlike standard gzip, which produces a single, non-indexed compressed stream, bgzip divides the input into small, independent blocks, each compressed separately using gzip. This unique block-based compression allows for efficient random access to any part of the compressed file without decompressing the entire file, which is crucial for large datasets. It's widely used in bioinformatics for compressing genomic data files like BAM, VCF, and BCF, as it enables tools like samtools and tabix to quickly retrieve specific data ranges. While it creates .gz files, these are not directly interchangeable with gzip-compressed files if random access is required; bgzip-produced files are optimized for index-based access.

CAVEATS

bgzip is specifically designed for block-gzipped files, primarily for random access in genomic data. While it produces files with a .gz extension, these are not directly interchangeable with standard gzip files if random access is a requirement, and tools that rely purely on sequential gzip streams might not fully leverage its block structure. The overhead of block headers means it might be slightly less efficient for pure sequential decompression of small files compared to gzip. It's not intended as a general-purpose gzip replacement for all use cases.

<B>BGZF FORMAT</B>

The Block Gzip Format (BGZF) is an extension of the gzip format that divides the compressed data into blocks. Each block is independently compressed and includes a special header containing the size of the compressed block. This structure allows tools to skip directly to specific blocks based on an index, enabling efficient random access. This is particularly useful for large datasets where only small portions of the data need to be accessed at a time.

<B>RANDOM ACCESS AND INDEXING</B>

The primary advantage of bgzip over standard gzip is its ability to support random access. By compressing data in blocks, an accompanying index (often created by tabix or samtools for coordinate-sorted files) can store the file offset for each data record or genomic region. When a specific region is requested, the tool can use the index to jump directly to the relevant BGZF block, decompress only that block, and retrieve the data. This capability significantly speeds up queries on large, sorted datasets, a common requirement in genomics.

HISTORY

bgzip was developed as part of the hts-lib (High-Throughput Sequence Library) project, which provides the underlying library for `samtools` and `bcftools`. Its creation was driven by the necessity for efficient random access to increasingly large genomic data files (like BAM and VCF files) that are often sorted by chromosomal coordinates. Traditional gzip files do not support this, leading to the development of the Block Gzip Format (BGZF) and the bgzip utility to implement it, thereby enabling fast lookups of specific regions without decompressing the entire file, which is critical for bioinformatics workflows.

SEE ALSO

gzip(1), gunzip(1), tabix(1), samtools(1)

Copied to clipboard