blastdbcmd
Retrieve sequences from BLAST database
SYNOPSIS
blastdbcmd -db database_name [-options] [-entry entry_identifier | -entry_batch filename]
blastdbcmd -db database_name -info
PARAMETERS
-db
Specifies the name of the BLAST database to query, e.g., nr or nt.
-entry
Retrieves one or more sequence identifiers (accessions, GIs, etc.). Multiple identifiers can be separated by commas.
-entry_batch
Specifies a file containing a list of sequence identifiers (one per line) to retrieve.
-out
Writes the output to a specified file instead of standard output.
-outfmt
Specifies the output format. Common values include fasta (default), seqid, acc, def, asn, taxids. Can also be a custom format string.
-target_only
Retrieves only the target (non-masking) portion of the sequence, if masking information is available.
-range
Specifies a subsequence to retrieve, defined by a 1-based start:end position.
-strand
For nucleotide sequences, specifies 'plus' (default) or 'minus' strand.
-lcase_masking
Displays masked regions in lowercase letters if masking information is available in the database.
-dbtype
Specifies the database type: 'prot' for protein or 'nucl' for nucleotide. Usually inferred from the database files.
-parse_seqids
Forces parsing of sequence IDs in the input (e.g., with -entry_batch) to handle various ID formats.
-info
Displays detailed information about the specified BLAST database, including statistics like number of sequences, total length, and creation date.
-version
Displays the blastdbcmd version information.
-help
Displays the help message for blastdbcmd.
-logfile
Writes program log information to the specified file.
-warning_logfile
Writes program warnings to the specified file.
-taxids
Restricts sequence retrieval to sequences associated with the specified NCBI Taxonomy IDs (comma-separated).
-taxidlist
Restricts sequence retrieval to sequences associated with NCBI Taxonomy IDs listed in the specified file (one per line).
-as_annotated
Retrieves sequences as they are annotated in the database, including any associated masked regions.
-get_dups
Retrieves all duplicate entries for the specified entry identifier.
-gilist
Restricts sequence retrieval to GIs listed in the specified file.
-seqidlist
Restricts sequence retrieval to sequence IDs listed in the specified file.
DESCRIPTION
blastdbcmd is a versatile command-line utility within the NCBI BLAST+ applications suite. It allows users to extract sequence data, definitions, accession numbers, and other associated information directly from BLAST databases (composed of .nhr, .nin, .nsq for nucleotide; .phr, .pin, .psq for protein). It's an essential tool for examining database contents, preparing subsets of sequences, or verifying database integrity without needing to run a full BLAST search. Users can specify entries by accession, GI, or other identifiers, apply filters, and format the output according to their needs.
CAVEATS
blastdbcmd requires pre-built BLAST databases created with makeblastdb; it cannot directly query raw FASTA files. Performance can be impacted when retrieving a very large number of individual entries. Ensure the correct database type (nucl or prot) is used or inferrable. For very large retrieval tasks, -entry_batch is highly recommended over multiple individual -entry calls due to overhead.
DATABASE TYPES
BLAST databases are created as either 'nucl' (nucleotide) or 'prot' (protein). blastdbcmd usually infers this from the database files, but explicitly specifying -dbtype can sometimes resolve ambiguities or errors.
IDENTIFIER HANDLING
blastdbcmd is capable of handling various sequence identifiers including NCBI GIs, accession numbers, and local IDs. The -parse_seqids option is crucial when dealing with identifiers that have special characters or complex formats, ensuring proper matching within the database.
HISTORY
blastdbcmd is a core component of the NCBI BLAST+ applications suite, which was developed to modernize and replace the legacy BLAST standalone utilities (like formatdb and blastall). The BLAST+ suite introduced a more modular command-line interface, improved performance, and added new features. blastdbcmd specifically took over the functionality of retrieving sequences from BLAST databases, which was previously less streamlined. Its development reflects the ongoing effort to provide robust and efficient tools for sequence similarity searching and database management.