bgzip
Compress genomic data files with indexing
SYNOPSIS
bgzip [options] [file ...]
PARAMETERS
-c, --stdout
Write output on standard output; keep original files unchanged.
-d, --decompress, --uncompress
Decompress.
-f, --force
Force overwrite of output file and compress although having multiple links.
-h, --help
Give this help.
-i, --index
Index the output file.
-l, --level
Compression level: 0-9, default 6.
-n, --number
Process
-q, --quiet
Suppress all warnings.
-r, --replace
Replace the original file with the compressed/uncompressed file. Original file may be automatically deleted.
-s, --shared-input
Share the input FD when using -n
-t, --test
Test compressed file integrity.
-v, --verbose
Give more verbose output.
-@, --threads
Use
-z, --compress
Compress.
DESCRIPTION
bgzip is a command-line utility used for compressing and decompressing files in the BGZF (Blocked GNU Zip Format) format. BGZF is a variant of gzip designed to allow for efficient random access within the compressed file. This is crucial for applications like genomics, where rapid retrieval of specific data sections is essential. Unlike standard gzip, BGZF divides the input data into independent blocks, each compressed using gzip.
bgzip is often used in bioinformatics workflows for handling large sequence files like BAM, VCF, and FASTQ, as these files are typically compressed and indexed for quick access. The corresponding tabix program requires the BAM file to be sorted and compressed using bgzip before creating its index. It is primarily developed and maintained for usage in Bioinformatics.
CAVEATS
While bgzip provides random access capabilities, it is still based on gzip and therefore does not achieve the high compression ratios of more modern compression algorithms. It's crucial to use the corresponding `tabix` command to create an index file to enable random access within the bgzipped file.
FILE NAMING CONVENTIONS
By convention, files compressed with bgzip typically have the `.gz` extension. While bgzip itself doesn't enforce this, tools like tabix and other bioinformatics programs rely on this naming convention to identify BGZF compressed files.
RANDOM ACCESS
Random access is only possible through indexing. Typically the tabix command is used to create an index (.tbi file) for a bgzip compressed file. This index maps byte offsets within the compressed file to genomic coordinates, allowing rapid retrieval of data for a specific genomic region. Without the index, bgzip simply acts as a regular gzip compressor/decompressor.
HISTORY
bgzip was primarily developed as part of the SAMtools project, a suite of tools for interacting with and processing data in the Sequence Alignment/Map (SAM) format. Its main purpose was to provide a compression format suitable for indexed access within BAM files. The BGZF format upon which bgzip is based, was designed with this specific use case in mind. Over time, bgzip has become a standard tool within the bioinformatics community for compressing and decompressing large genomic datasets.