makeblastdb

Create BLAST sequence databases

SYNOPSIS

makeblastdb -in <input_file> -out <database_name> -dbtype <nucl|prot> [options]

-in
    Specifies the input sequence file(s) to be formatted into a database. This file typically contains sequences in FASTA format.

-out
    Defines the base name for the output database files. Multiple files with this prefix and different extensions will be created.

-dbtype
    Specifies the type of sequences in the input file: 'nucl' for nucleotide sequences or 'prot' for protein sequences. This is a crucial parameter.

-title
    Assigns a descriptive title to the database, which is displayed by BLAST programs when searching.

-parse_seqids
    Instructs makeblastdb to parse sequence IDs to allow for easier retrieval of sequences and associated information later.

-hash_index
    Creates a hash index for the database, which can speed up certain types of searches, especially those involving exact ID lookups.

-long_seqids
    Allows sequence IDs longer than the default 32 characters to be used. Useful for databases with very descriptive sequence identifiers.

-taxid_map
    Provides a mapping file between sequence IDs and NCBI taxonomy IDs, enabling taxonomic filtering in subsequent BLAST searches.

-logfile
    Directs the command's output and error messages to a specified log file.

-max_file_size
    Sets the maximum size for individual database files. Useful for very large databases that might exceed file system limits, resulting in multiple database parts.

DESCRIPTION

makeblastdb is an essential utility within the NCBI BLAST (Basic Local Alignment Search Tool) suite. Its primary function is to transform raw sequence data, typically in FASTA format, into specially indexed and formatted databases. These databases are highly optimized for rapid searching by various BLAST programs (like blastn, blastp, tblastn, etc.).

Without a database created by makeblastdb, BLAST programs cannot efficiently search against large collections of sequences. It pre-processes the sequences, creating index files (e.g., .nhr, .nin, .nsq for nucleotide databases, or .phr, .pin, .psq for protein databases) that allow for quick retrieval and comparison of sequences. This pre-computation significantly reduces the time required for subsequent sequence similarity searches, making BLAST an incredibly powerful tool for bioinformatics.

CAVEATS

Input Format: The input file must be in a format supported by BLAST, primarily FASTA. Incorrectly formatted files can lead to errors or incomplete databases.
Disk Space: Creating databases, especially from large genomic or proteomic datasets, requires substantial disk space, often several times the size of the input FASTA file due to indexing.
Memory Usage: For very large input files, makeblastdb can consume significant RAM during the indexing process.
dbtype is Critical: Mismatching the -dbtype (nucleotide vs. protein) with the actual sequence content will result in an unusable database and incorrect BLAST search results.

DATABASE FILES CREATED

When you run makeblastdb, it generates several files with specific extensions based on the chosen -out name and -dbtype. For nucleotide databases, you'll typically see .nhr (header), .nin (index), and .nsq (sequence) files. For protein databases, these are .phr, .pin, and .psq respectively. These files collectively form the searchable BLAST database.

PERFORMANCE CONSIDERATIONS

The time taken to create a database can vary significantly based on the input file size, system hardware (CPU, RAM, I/O speed), and the number of sequence IDs. For very large datasets, using fast storage (like SSDs) can significantly improve the database creation time.

HISTORY

The makeblastdb utility is an integral part of the NCBI BLAST suite, which has been a cornerstone of bioinformatics since its initial publication in 1990. Prior to makeblastdb, a similar utility called formatdb was used for creating BLAST databases. makeblastdb was introduced as an improved, more robust, and feature-rich replacement for formatdb, offering better handling of large datasets, improved indexing, and support for newer database versions. Its continuous development reflects the ongoing need for efficient sequence comparison in the rapidly expanding world of biological data.