LinuxCommandLibrary

bgzip

Compress genomic data files with indexing

SYNOPSIS

bgzip [options] [file ...]

PARAMETERS

-c, --stdout
    Write output on standard output; keep original files unchanged.

-d, --decompress, --uncompress
    Decompress.

-f, --force
    Force overwrite of output file and compress although having multiple links.

-h, --help
    Give this help.

-i, --index
    Index the output file.

-l, --level
    Compression level: 0-9, default 6.

-n, --number
    Process input files simultaneously (default 1).

-q, --quiet
    Suppress all warnings.

-r, --replace
    Replace the original file with the compressed/uncompressed file. Original file may be automatically deleted.

-s, --shared-input
    Share the input FD when using -n

-t, --test
    Test compressed file integrity.

-v, --verbose
    Give more verbose output.

-@, --threads
    Use additional threads for compression/decompression.

-z, --compress
    Compress.

DESCRIPTION

bgzip is a command-line utility used for compressing and decompressing files in the BGZF (Blocked GNU Zip Format) format. BGZF is a variant of gzip designed to allow for efficient random access within the compressed file. This is crucial for applications like genomics, where rapid retrieval of specific data sections is essential. Unlike standard gzip, BGZF divides the input data into independent blocks, each compressed using gzip.

bgzip is often used in bioinformatics workflows for handling large sequence files like BAM, VCF, and FASTQ, as these files are typically compressed and indexed for quick access. The corresponding tabix program requires the BAM file to be sorted and compressed using bgzip before creating its index. It is primarily developed and maintained for usage in Bioinformatics.

CAVEATS

While bgzip provides random access capabilities, it is still based on gzip and therefore does not achieve the high compression ratios of more modern compression algorithms. It's crucial to use the corresponding `tabix` command to create an index file to enable random access within the bgzipped file.

FILE NAMING CONVENTIONS

By convention, files compressed with bgzip typically have the `.gz` extension. While bgzip itself doesn't enforce this, tools like tabix and other bioinformatics programs rely on this naming convention to identify BGZF compressed files.

RANDOM ACCESS

Random access is only possible through indexing. Typically the tabix command is used to create an index (.tbi file) for a bgzip compressed file. This index maps byte offsets within the compressed file to genomic coordinates, allowing rapid retrieval of data for a specific genomic region. Without the index, bgzip simply acts as a regular gzip compressor/decompressor.

HISTORY

bgzip was primarily developed as part of the SAMtools project, a suite of tools for interacting with and processing data in the Sequence Alignment/Map (SAM) format. Its main purpose was to provide a compression format suitable for indexed access within BAM files. The BGZF format upon which bgzip is based, was designed with this specific use case in mind. Over time, bgzip has become a standard tool within the bioinformatics community for compressing and decompressing large genomic datasets.

SEE ALSO

gzip(1), gunzip(1), tabix(1)

Copied to clipboard