blastp
Search protein databases for sequence similarity
TLDR
Align two or more sequences using blastp, with the e-value threshold of 1e-9, pairwise output format, output to screen
Align two or more sequences using blastp-fast
Align two or more sequences, custom tabular output format, output to file
Search protein databases using a protein query, 16 threads to use in the BLAST search, with a maximum number of 10 aligned sequences to keep
Search the remote non-redundant protein database using a protein query
Display help (use -help for detailed help)
SYNOPSIS
blastp -query <input_query_file> -db <database_name> [options]
Example: blastp -query my_protein.fasta -db nr -out results.txt -evalue 0.001 -num_threads 8
PARAMETERS
-query <file>
Path to the input protein query sequence(s) in FASTA format.
-db <string>
Name of the protein database to search against. This database must be pre-formatted using makeblastdb.
-out <file>
Path for the output file. If not specified, output goes to standard output.
-evalue <float>
Expectation value (E-value) threshold. Alignments with E-values higher than this threshold are not reported. Lower values mean stricter significance.
-outfmt <string>
Output format. Common options include: 0 (Pairwise), 5 (XML), 6 (Tabular with comment lines), 7 (Tabular with headers), 10 (CSV), 11 (Blast archive). Custom formats can be specified using format specifiers.
-num_threads <integer>
Number of CPU threads to use for the search, speeding up execution on multi-core systems.
-max_target_seqs <integer>
Maximum number of aligned sequences to keep and report for each query.
-remote
Run the search on a remote NCBI server instead of locally. Requires internet connection.
-matrix <string>
Scoring matrix to use (e.g., BLOSUM62, BLOSUM45, PAM30). Default is BLOSUM62.
-word_size <integer>
Size of the initial word match (seed length). Larger values increase speed but decrease sensitivity.
DESCRIPTION
blastp is a fundamental bioinformatics command-line tool from the NCBI BLAST (Basic Local Alignment Search Tool) suite. Its primary function is to compare a protein query sequence or sequences against a protein sequence database to identify regions of local similarity. This similarity search helps in inferring functional and evolutionary relationships between sequences. It works by finding short matches between sequences and then extending these matches into longer alignments, statistically evaluating their significance. blastp is widely used for protein annotation, homolog detection, and comparative genomics studies, providing crucial insights into protein function and evolution.
CAVEATS
blastp requires a pre-built BLAST database (created with makeblastdb) to run locally.
For large query sets or databases, searches can be computationally intensive and time-consuming.
Interpreting results requires understanding statistical metrics like E-value and bit score.
The accuracy of results depends on the quality and currency of the database used.
Different scoring matrices (-matrix) and E-value thresholds (-evalue) can significantly impact search sensitivity and specificity.
UNDERSTANDING E-VALUE AND BIT SCORE
The E-value (Expectation value) indicates the number of hits one can expect to see by chance when searching a database of a particular size. A lower E-value means a more significant match. The Bit score provides a measure of sequence similarity independent of database size and is based on the alignment's raw score, normalized by the scoring system.
BLAST DATABASES
To perform a local blastp search, a target protein sequence database must first be created using the makeblastdb utility. This typically involves converting a FASTA file containing protein sequences into a set of indexed files that blastp can efficiently search. Common public databases include NCBI's nr (non-redundant protein sequences).
HISTORY
The BLAST algorithm was first described by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman in 1990. blastp is a core component of the NCBI BLAST suite, which has undergone continuous development and improvements over decades, including the development of BLAST+ (BLAST executables rewritten in C++ for better performance and flexibility). Its widespread adoption has made it an indispensable tool in genomics and proteomics research.
SEE ALSO
makeblastdb(1), blastn(1), blastx(1), tblastn(1), tblastx(1)