blastp

Search protein databases for sequence similarity

TLDR

Align two or more sequences using blastp, with the e-value threshold of 1e-9, pairwise output format, output to screen

$ blastp -query [query.fa] -subject [subject.fa] -evalue [1e-9]

Align two or more sequences using blastp-fast

$ blastp -task blastp-fast -query [query.fa] -subject [subject.fa]

Align two or more sequences, custom tabular output format, output to file

$ blastp -query [query.fa] -subject [subject.fa] -outfmt '[6 qseqid qlen qstart qend sseqid slen sstart send bitscore evalue pident]' -out [output.tsv]

Search protein databases using a protein query, 16 threads to use in the BLAST search, with a maximum number of 10 aligned sequences to keep

$ blastp -query [query.fa] -db [blast_database_name] -num_threads [16] -max_target_seqs [10]

Search the remote non-redundant protein database using a protein query

$ blastp -query [query.fa] -db [nr] -remote

Display help (use -help for detailed help)

$ blastp -h

SYNOPSIS

blastp -query <input_query_file> -db <database_name> [options]
Example: blastp -query my_protein.fasta -db nr -out results.txt -evalue 0.001 -num_threads 8

-query <file>
    Path to the input protein query sequence(s) in FASTA format.

-db <string>
    Name of the protein database to search against. This database must be pre-formatted using makeblastdb.

-out <file>
    Path for the output file. If not specified, output goes to standard output.

-evalue <float>
    Expectation value (E-value) threshold. Alignments with E-values higher than this threshold are not reported. Lower values mean stricter significance.

-outfmt <string>
    Output format. Common options include: 0 (Pairwise), 5 (XML), 6 (Tabular with comment lines), 7 (Tabular with headers), 10 (CSV), 11 (Blast archive). Custom formats can be specified using format specifiers.

-num_threads <integer>
    Number of CPU threads to use for the search, speeding up execution on multi-core systems.

-max_target_seqs <integer>
    Maximum number of aligned sequences to keep and report for each query.

-remote
    Run the search on a remote NCBI server instead of locally. Requires internet connection.

-matrix <string>
    Scoring matrix to use (e.g., BLOSUM62, BLOSUM45, PAM30). Default is BLOSUM62.

-word_size <integer>
    Size of the initial word match (seed length). Larger values increase speed but decrease sensitivity.

DESCRIPTION

blastp is a fundamental bioinformatics command-line tool from the NCBI BLAST (Basic Local Alignment Search Tool) suite. Its primary function is to compare a protein query sequence or sequences against a protein sequence database to identify regions of local similarity. This similarity search helps in inferring functional and evolutionary relationships between sequences. It works by finding short matches between sequences and then extending these matches into longer alignments, statistically evaluating their significance. blastp is widely used for protein annotation, homolog detection, and comparative genomics studies, providing crucial insights into protein function and evolution.

CAVEATS

blastp requires a pre-built BLAST database (created with makeblastdb) to run locally.
For large query sets or databases, searches can be computationally intensive and time-consuming.
Interpreting results requires understanding statistical metrics like E-value and bit score.
The accuracy of results depends on the quality and currency of the database used.
Different scoring matrices (-matrix) and E-value thresholds (-evalue) can significantly impact search sensitivity and specificity.

UNDERSTANDING E-VALUE AND BIT SCORE

The E-value (Expectation value) indicates the number of hits one can expect to see by chance when searching a database of a particular size. A lower E-value means a more significant match. The Bit score provides a measure of sequence similarity independent of database size and is based on the alignment's raw score, normalized by the scoring system.

BLAST DATABASES

To perform a local blastp search, a target protein sequence database must first be created using the makeblastdb utility. This typically involves converting a FASTA file containing protein sequences into a set of indexed files that blastp can efficiently search. Common public databases include NCBI's nr (non-redundant protein sequences).

HISTORY

The BLAST algorithm was first described by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman in 1990. blastp is a core component of the NCBI BLAST suite, which has undergone continuous development and improvements over decades, including the development of BLAST+ (BLAST executables rewritten in C++ for better performance and flexibility). Its widespread adoption has made it an indispensable tool in genomics and proteomics research.