bgzip

Compress genomic data files with indexing

SYNOPSIS

bgzip [-@ threads] [-b block] [-c] [-d] [-D] [-f] [-H] [-h] [-i index] [-l level] [-m size] [-o file] [-p parts] [-r block] [-s size] [-v] [-V] [file]

-@
    Number of threads to use in addition to main thread [1]

-b INT
    Initial block size; also input block size on decompress [16384]

-c
    Write to standard output; decompress only if used alone

-d
    Decompress

-D
    Output virtual offset blocks (first 4 bytes per block)

-f
    Overwrite the output file if it exists

-H
    Harmonize block offsets across input files (for tabix -H)

-h
    This help message

-i STR
    Create index for the BGZF file STR (regular file only)

-l INT
    Compression level [1]; 0=uncompressed, 1-9

-m INT
    Output buffer size in MB [16]

-o FILE
    Output file; compression only

-p INT
    Split into INT parts [1]

-r INT
    Last reference block [0]

-s INT
    Expected total size of uncompressed file [0]

-v
    Print verbose messages

-V
    Display version information

DESCRIPTION

bgzip compresses or decompresses files using the BGZF (Blocked GNU zip Format), a gzip-compatible format that supports fast random access.

Unlike standard gzip, BGZF splits the data into independent blocks (typically 16-64KB), each with a virtual offset file position. This enables indexed random access via tools like tabix, crucial for large genomic files (BAM, CRAM, VCF, BED).

When compressing bgzip example.vcf, it creates example.vcf.gz as BGZF. Decompression with bgzip -d yields the original uncompressed file. It supports multi-threading for speedups on modern hardware, adjustable compression levels (1-9), custom block sizes, and index generation.

BGZF files decompress seamlessly with any gzip tool but retain random-access capabilities only when processed by BGZF-aware software. Widely used in bioinformatics pipelines (SAMtools/HTSlib) for handling terabyte-scale datasets efficiently without full decompression.

Key advantages: seek time independent of file size, parallel compression/decompression, and compatibility with gzip ecosystems.

CAVEATS

BGZF output uses '.gz' extension but is not plain gzip; plain gzip can decompress it, but random access requires BGZF-aware tools like tabix. Large block sizes may hinder seek performance. Multi-part output (-p) creates split files unsuitable for direct indexing.

EXAMPLES

bgzip sample.bam
Compresses to sample.bam.gz (BGZF).

bgzip -c -@ 4 big.vcf
Compress to stdout using 4 threads.

bgzip -cd -i sample.vcf.gz
Decompress to stdout and create index.

bgzip -d --force file.gz
Decompress, overwriting if needed.

INDEXING

After bgzip, use tabix -p vcf file.vcf.gz for random access. Indexes map genomic coordinates to BGZF blocks.

HISTORY

Developed by Heng Li in 2010 as part of SAMtools for efficient BAM file handling. Integrated into HTSlib (successor library) around 2012; now standard in bioinformatics for indexed compression. Evolved with multi-threading support in HTSlib 1.3+ (2017).