blastx
Search translated nucleotide sequence against protein database
SYNOPSIS
blastx
-query <file>
-db <database>
[-out <file>]
[-evalue <value>]
[-outfmt <format_spec>]
[options...]
PARAMETERS
-query <file>
Specifies the input file containing the nucleotide query sequence(s) in FASTA or other supported formats.
-db <database>
Defines the target protein sequence database against which the query will be searched. This database must be pre-formatted using makeblastdb.
-out <file>
Specifies the file name for saving the BLAST results. If not provided, results are printed to standard output (stdout).
-evalue <value>
Sets the statistical significance threshold for reporting matches. Hits with an E-value greater than this threshold are not reported. A smaller E-value indicates a more significant match.
-outfmt <format_spec>
Determines the format of the output. Common values include 0 (pairwise), 5 (XML), 6 (tabular), 7 (tabular with comments), and 11 (BLAST archive format). Custom formats can be specified using a list of field identifiers.
-num_threads <integer>
Specifies the number of CPU threads to use for the search, allowing for parallel execution and faster results on multi-core systems.
-max_target_seqs <integer>
Limits the maximum number of aligned sequences (hits) to be reported per query sequence. Only the best hits, based on E-value, are kept.
-html
Produces BLAST results in an HTML format, suitable for viewing in a web browser.
DESCRIPTION
blastx is a fundamental program within the NCBI BLAST+ (Basic Local Alignment Search Tool) suite. Its primary function is to compare a nucleotide query sequence (e.g., DNA or RNA) against a protein sequence database. The core mechanism involves conceptually translating the nucleotide query into all six possible reading frames (three in the forward direction and three on the reverse complementary strand). Each of these translated protein sequences is then used as a query to perform a protein-protein alignment search against the specified protein database.
This approach is invaluable for tasks such as identifying potential protein-coding regions within a novel nucleotide sequence, finding homologous proteins when only genomic or transcriptomic DNA/RNA data is available, and performing gene prediction. The output typically provides detailed alignment information, including statistical significance (E-value), bit scores, and sequence identity, allowing users to infer the function and evolutionary relationships of their query sequence based on similarities to known proteins in the database.
CAVEATS
- blastx requires a pre-indexed protein database, which must be created using the makeblastdb command.
- The translation of all six reading frames can be computationally intensive, especially for very long nucleotide queries, potentially leading to longer execution times.
- The accuracy of the search results is highly dependent on the quality of the query sequence and the comprehensiveness and quality of the protein database being searched.
- Frameshifts or sequencing errors within the query DNA can lead to incorrect or truncated amino acid translations, potentially causing true protein homologs to be missed or incorrectly identified.
OUTPUT INTERPRETATION
Understanding the output of blastx is crucial for deriving biological insights. Key metrics to pay attention to include:
- E-value (Expectation Value): Represents the number of hits one would expect to see by chance when searching a database of a particular size. A lower E-value indicates a more statistically significant match.
- Bit Score: A normalized score that is independent of database size and provides a measure of alignment quality. Higher bit scores indicate better alignments.
- Percent Identity: Indicates the percentage of identical amino acids (or nucleotides, depending on the BLAST program) in the aligned regions of the query and subject sequences.
DATABASE CREATION
Before running blastx, a protein database must be prepared and formatted using the makeblastdb command from the NCBI BLAST+ suite. This command indexes the sequences, making them searchable by BLAST programs. For example, to create a searchable protein database from a FASTA file named uniprot_sprot.fasta
, you would use the following command:makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_sprot_db
Once created, 'uniprot_sprot_db' can then be specified with the -db option when running blastx.
HISTORY
The Basic Local Alignment Search Tool (BLAST) was first introduced in 1990 by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman. It revolutionized sequence similarity searching by providing a heuristic algorithm that significantly sped up the search process compared to earlier exact alignment algorithms like Smith-Waterman, while maintaining high sensitivity.
blastx is one of the original and core programs in the BLAST suite, specifically designed to address the common challenge of identifying protein homologs and coding regions directly from nucleotide sequences. This was particularly crucial for the early days of genomics and transcriptomics. The original C-based toolkit evolved into the modern NCBI BLAST+ suite (often referred to as BLAST rewritten in C++), which offers improved performance, modularity, and enhanced features, including better support for multi-threading. blastx remains a fundamental and widely used tool for gene prediction and functional annotation of newly sequenced genomes and transcripts.
SEE ALSO
blastn(1), blastp(1), tblastn(1), tblastx(1), makeblastdb(1), blastdbcmd(1)