blastp
Search protein databases for sequence similarity
TLDR
Align two or more sequences using blastp, with the e-value threshold of 1e-9, pairwise output format, output to screen
Align two or more sequences using blastp-fast
Align two or more sequences, custom tabular output format, output to file
Search protein databases using a protein query, 16 threads to use in the BLAST search, with a maximum number of 10 aligned sequences to keep
Search the remote non-redundant protein database using a protein query
Display help (use -help for detailed help)
SYNOPSIS
blastp [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbalias file] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-negative_seqidlist filename] [-taxids list] [-negative_taxids list] [-taxid_taxname_map filename] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-query_loc range] [-strand strand] [-parse_deflines] [-out output_file] [-use_index boolean] [-show_gis] [-num_threads number] [-remote] [-version] [-reference reference] [-comp_based_stats string] [-seg string] [-gapopen integer] [-gapextend integer] [-qcov_hsp_perc float] [-xdrop_ungap float] [-xdrop_gap float] [-xdrop_gap_final float] [-ungapped] [-lmp_dump] [-line_length integer] [-verbose] [-query_believe_defline] [-db_soft_mask string] [-db_hard_mask string] [-subject_besthit] [-culling_limit integer] [-best_hit_overhang float] [-best_hit_score_edge float] [-window_size integer] [-off_diagonal_range integer] [-use_real_db boolean] [-index_name string] [-accession_version] [-blastdb_version] [-searchsp_eff integer] [-max_hsps_per_subject integer] [-max_target_seqs integer] [-num_descriptions integer] [-num_alignments integer] [-evalue real] [-word_size integer] [-gap_trigger integer] [-no_greedy] [-min_raw_gapped_score integer] [-dust string] [-filtering_algorithm integer] [-template_type integer] [-template_length integer] [-extension_dropoff_prelim real] [-extension_dropoff real] [-window_masker_taxid integer] [-window_masker_db string] [-perc_identity real] [-length integer] [-hspmax integer] [-hitseq_start integer] [-hitseq_end integer] [-qseq_start integer] [-qseq_end integer] [-max_intron_length integer] [-effective_search_space number] [-inframe_query] [-outfmt format] [-show_domain_id] [-domain_significance_levels string] [-use_sw_tback] [-use_sum_tback] [-sum_statistics] [-pseudocount integer] [-inclusion_ethresh real]
PARAMETERS
-query file
Input file containing the query sequence(s) in FASTA format.
-db database
Name of the protein database to search against (e.g., 'nr').
-out file
Output file to store the BLASTP results.
-evalue value
Expect value (E-value) threshold for reporting significant hits. Lower values are more stringent.
-num_threads integer
Number of CPU threads to use for the search. Speeds up execution.
-outfmt format
Format of the output file. Common formats include 0 (pairwise), 5 (XML), and 6 (tabular with headers).
Format 6 is highly recommended for parsing as it is a tab separated file with the most important information about each alignment.
-remote
Execute the search on the NCBI servers.
-task task_name
Specifies a predefined search strategy to use. Useful task values are: blastp, blastp-fast, blastp-short.
-max_target_seqs integer
Maximum number of aligned sequences to keep. Defaul is 500, can be useful to lower this number for testing purposes.
DESCRIPTION
BLASTP, which stands for Basic Local Alignment Search Tool - Protein, is a powerful command-line tool used for comparing a protein query sequence against a protein sequence database. It identifies regions of local similarity between the query sequence and sequences in the database.
The core algorithm uses heuristics to rapidly identify significant alignments. The program evaluates statistical significance of hits using a statistical model appropriate for local alignments with gaps. The results include a list of database sequences that have significant similarity to the query sequence, along with alignment scores, E-values, and other statistical information.
BLASTP is widely used in bioinformatics for various tasks, including protein function prediction, identifying homologs, and exploring evolutionary relationships. It is a crucial tool for analyzing newly sequenced proteins, helping researchers understand their potential roles and interactions within biological systems. The command requires a query sequence (in FASTA format or a similar format) and a protein database. It also allows you to customize search parameters to fine-tune the sensitivity and specificity of the search.
Understanding the output and tweaking search parameters is key to obtaining meaningful results.
CAVEATS
The database must be formatted using `makeblastdb` before use. The size of the database and query sequences can significantly impact execution time. Understanding E-values is crucial for interpreting results. Always cite the BLAST software in publications.
DATABASE CONSIDERATIONS
Selecting the appropriate database is crucial. The 'nr' database is a comprehensive non-redundant protein database, but smaller, more specialized databases may be more suitable for specific analyses. Ensure the database is up-to-date for best results.
FILTERING
Options like `-seg` and `-dust` can be used to filter low-complexity regions in the query or database sequences, which can reduce spurious hits. These options affect the search speed and the resulting hits.
UNDERSTANDING E-VALUES
The E-value represents the number of expected hits of equivalent score that could be found simply by chance. Lower E-values indicate more significant alignments. A common threshold is E < 0.05 or E < 0.01, but this can be adjusted depending on the specific application.
HISTORY
BLAST (Basic Local Alignment Search Tool) was initially developed in the early 1990s by researchers at the National Center for Biotechnology Information (NCBI). The original algorithm was designed for rapid sequence comparison.
The BLASTP program was specifically designed for comparing protein sequences. Over the years, BLAST has undergone numerous revisions and improvements, including the introduction of gapped BLAST, Position-Specific Iterated BLAST (PSI-BLAST), and other enhancements to improve sensitivity and speed. It became a foundational tool for bioinformatics and molecular biology, enabling researchers to quickly identify homologous sequences and infer protein functions. The command-line versions of BLAST, including blastp, remain widely used in high-throughput sequence analysis pipelines.
SEE ALSO
blastn(1), blastx(1), tblastn(1), tblastx(1), makeblastdb(1)