bgzip
Compress genomic data files with indexing
SYNOPSIS
bgzip [options] [file...]
PARAMETERS
-c, --stdout
Write output to standard output, keep original files unchanged.
-d, --decompress
Decompress files.
-i, --index
Create a TBI index for the compressed file (requires tabix compatibility, data must be coordinate-sorted).
-I file, --index-filename file
Specify the filename for the index (used with -i
).
-f, --force
Overwrite existing output files without prompting.
-r, --reindex
Recreate TBI index for an existing BGZF file (requires data to be coordinate-sorted).
-s num, --nthreads num
Number of threads to use for compression/decompression.
-t, --test
Test integrity of compressed file.
-b offset, --offset offset
Decompress from offset (for testing or partial decompression).
-l, --fast
Use faster compression (produces larger files).
-o file, --output file
Write output to specified file.
-@ level, --compression-level level
Compression level (1-9, where 9 is best compression).
-k, --keep
Keep (don't delete) input files during compression.
-h, --help
Display help message and exit.
DESCRIPTION
bgzip
is a utility for compressing files into the Block Gzip Format (BGZF). Unlike standard gzip
, which produces a single, non-indexed compressed stream, bgzip
divides the input into small, independent blocks, each compressed separately using gzip
. This unique block-based compression allows for efficient random access to any part of the compressed file without decompressing the entire file, which is crucial for large datasets. It's widely used in bioinformatics for compressing genomic data files like BAM, VCF, and BCF, as it enables tools like samtools
and tabix
to quickly retrieve specific data ranges. While it creates .gz
files, these are not directly interchangeable with gzip
-compressed files if random access is required; bgzip
-produced files are optimized for index-based access.
CAVEATS
bgzip
is specifically designed for block-gzipped files, primarily for random access in genomic data. While it produces files with a .gz
extension, these are not directly interchangeable with standard gzip
files if random access is a requirement, and tools that rely purely on sequential gzip
streams might not fully leverage its block structure. The overhead of block headers means it might be slightly less efficient for pure sequential decompression of small files compared to gzip
. It's not intended as a general-purpose gzip
replacement for all use cases.
<B>BGZF FORMAT</B>
The Block Gzip Format (BGZF) is an extension of the gzip
format that divides the compressed data into blocks. Each block is independently compressed and includes a special header containing the size of the compressed block. This structure allows tools to skip directly to specific blocks based on an index, enabling efficient random access. This is particularly useful for large datasets where only small portions of the data need to be accessed at a time.
<B>RANDOM ACCESS AND INDEXING</B>
The primary advantage of bgzip
over standard gzip
is its ability to support random access. By compressing data in blocks, an accompanying index (often created by tabix
or samtools
for coordinate-sorted files) can store the file offset for each data record or genomic region. When a specific region is requested, the tool can use the index to jump directly to the relevant BGZF block, decompress only that block, and retrieve the data. This capability significantly speeds up queries on large, sorted datasets, a common requirement in genomics.
HISTORY
bgzip
was developed as part of the hts-lib
(High-Throughput Sequence Library) project, which provides the underlying library for `samtools` and `bcftools`. Its creation was driven by the necessity for efficient random access to increasingly large genomic data files (like BAM and VCF files) that are often sorted by chromosomal coordinates. Traditional gzip
files do not support this, leading to the development of the Block Gzip Format (BGZF) and the bgzip
utility to implement it, thereby enabling fast lookups of specific regions without decompressing the entire file, which is critical for bioinformatics workflows.