LinuxCommandLibrary

blastp

Search protein databases for sequence similarity

TLDR

Align two or more sequences using blastp, with the e-value threshold of 1e-9, pairwise output format, output to screen

$ blastp -query [query.fa] -subject [subject.fa] -evalue [1e-9]
copy

Align two or more sequences using blastp-fast
$ blastp -task blastp-fast -query [query.fa] -subject [subject.fa]
copy

Align two or more sequences, custom tabular output format, output to file
$ blastp -query [query.fa] -subject [subject.fa] -outfmt '[6 qseqid qlen qstart qend sseqid slen sstart send bitscore evalue pident]' -out [output.tsv]
copy

Search protein databases using a protein query, 16 threads to use in the BLAST search, with a maximum number of 10 aligned sequences to keep
$ blastp -query [query.fa] -db [blast_database_name] -num_threads [16] -max_target_seqs [10]
copy

Search the remote non-redundant protein database using a protein query
$ blastp -query [query.fa] -db [nr] -remote
copy

Display help (use -help for detailed help)
$ blastp -h
copy

SYNOPSIS

blastp [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbalias file] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-negative_seqidlist filename] [-taxids list] [-negative_taxids list] [-taxid_taxname_map filename] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-query_loc range] [-strand strand] [-parse_deflines] [-out output_file] [-use_index boolean] [-show_gis] [-num_threads number] [-remote] [-version] [-reference reference] [-comp_based_stats string] [-seg string] [-gapopen integer] [-gapextend integer] [-qcov_hsp_perc float] [-xdrop_ungap float] [-xdrop_gap float] [-xdrop_gap_final float] [-ungapped] [-lmp_dump] [-line_length integer] [-verbose] [-query_believe_defline] [-db_soft_mask string] [-db_hard_mask string] [-subject_besthit] [-culling_limit integer] [-best_hit_overhang float] [-best_hit_score_edge float] [-window_size integer] [-off_diagonal_range integer] [-use_real_db boolean] [-index_name string] [-accession_version] [-blastdb_version] [-searchsp_eff integer] [-max_hsps_per_subject integer] [-max_target_seqs integer] [-num_descriptions integer] [-num_alignments integer] [-evalue real] [-word_size integer] [-gap_trigger integer] [-no_greedy] [-min_raw_gapped_score integer] [-dust string] [-filtering_algorithm integer] [-template_type integer] [-template_length integer] [-extension_dropoff_prelim real] [-extension_dropoff real] [-window_masker_taxid integer] [-window_masker_db string] [-perc_identity real] [-length integer] [-hspmax integer] [-hitseq_start integer] [-hitseq_end integer] [-qseq_start integer] [-qseq_end integer] [-max_intron_length integer] [-effective_search_space number] [-inframe_query] [-outfmt format] [-show_domain_id] [-domain_significance_levels string] [-use_sw_tback] [-use_sum_tback] [-sum_statistics] [-pseudocount integer] [-inclusion_ethresh real]

PARAMETERS

-query file
    Input file containing the query sequence(s) in FASTA format.

-db database
    Name of the protein database to search against (e.g., 'nr').

-out file
    Output file to store the BLASTP results.

-evalue value
    Expect value (E-value) threshold for reporting significant hits. Lower values are more stringent.

-num_threads integer
    Number of CPU threads to use for the search. Speeds up execution.

-outfmt format
    Format of the output file. Common formats include 0 (pairwise), 5 (XML), and 6 (tabular with headers).
Format 6 is highly recommended for parsing as it is a tab separated file with the most important information about each alignment.

-remote
    Execute the search on the NCBI servers.

-task task_name
    Specifies a predefined search strategy to use. Useful task values are: blastp, blastp-fast, blastp-short.

-max_target_seqs integer
    Maximum number of aligned sequences to keep. Defaul is 500, can be useful to lower this number for testing purposes.

DESCRIPTION

BLASTP, which stands for Basic Local Alignment Search Tool - Protein, is a powerful command-line tool used for comparing a protein query sequence against a protein sequence database. It identifies regions of local similarity between the query sequence and sequences in the database.

The core algorithm uses heuristics to rapidly identify significant alignments. The program evaluates statistical significance of hits using a statistical model appropriate for local alignments with gaps. The results include a list of database sequences that have significant similarity to the query sequence, along with alignment scores, E-values, and other statistical information.

BLASTP is widely used in bioinformatics for various tasks, including protein function prediction, identifying homologs, and exploring evolutionary relationships. It is a crucial tool for analyzing newly sequenced proteins, helping researchers understand their potential roles and interactions within biological systems. The command requires a query sequence (in FASTA format or a similar format) and a protein database. It also allows you to customize search parameters to fine-tune the sensitivity and specificity of the search.
Understanding the output and tweaking search parameters is key to obtaining meaningful results.

CAVEATS

The database must be formatted using `makeblastdb` before use. The size of the database and query sequences can significantly impact execution time. Understanding E-values is crucial for interpreting results. Always cite the BLAST software in publications.

DATABASE CONSIDERATIONS

Selecting the appropriate database is crucial. The 'nr' database is a comprehensive non-redundant protein database, but smaller, more specialized databases may be more suitable for specific analyses. Ensure the database is up-to-date for best results.

FILTERING

Options like `-seg` and `-dust` can be used to filter low-complexity regions in the query or database sequences, which can reduce spurious hits. These options affect the search speed and the resulting hits.

UNDERSTANDING E-VALUES

The E-value represents the number of expected hits of equivalent score that could be found simply by chance. Lower E-values indicate more significant alignments. A common threshold is E < 0.05 or E < 0.01, but this can be adjusted depending on the specific application.

HISTORY

BLAST (Basic Local Alignment Search Tool) was initially developed in the early 1990s by researchers at the National Center for Biotechnology Information (NCBI). The original algorithm was designed for rapid sequence comparison.

The BLASTP program was specifically designed for comparing protein sequences. Over the years, BLAST has undergone numerous revisions and improvements, including the introduction of gapped BLAST, Position-Specific Iterated BLAST (PSI-BLAST), and other enhancements to improve sensitivity and speed. It became a foundational tool for bioinformatics and molecular biology, enabling researchers to quickly identify homologous sequences and infer protein functions. The command-line versions of BLAST, including blastp, remain widely used in high-throughput sequence analysis pipelines.

SEE ALSO

blastn(1), blastx(1), tblastn(1), tblastx(1), makeblastdb(1)

Copied to clipboard