LinuxCommandLibrary

tblastn

Search nucleotide database with protein query

SYNOPSIS

tblastn -query -db [-out ] [options]

PARAMETERS

-query
    Specifies the input file containing the protein query sequence(s) in FASTA format.

-db
    Specifies the nucleotide BLAST database against which the query will be searched. This database must be pre-formatted using makeblastdb.

-out
    Specifies the output file where the search results will be written. If not specified, results are printed to standard output.

-outfmt
    Defines the output format for the results. Common formats include 0 (pairwise), 6 (tabular), 7 (tabular with comments), and 5 (XML).

-evalue
    Sets the maximum Expectation value (E-value) threshold for reported alignments. Alignments with E-values higher than this threshold are not reported. Lower values indicate higher significance.

-max_target_seqs
    Limits the maximum number of aligned sequences to keep per query.

-num_threads
    Specifies the number of CPU threads to use for the search, enabling parallel processing for faster execution.

-matrix
    Specifies the scoring matrix for protein alignments (e.g., BLOSUM62 (default), PAM30, PAM70).

-word_size
    Sets the length of the initial word (seed) for the search algorithm. Larger values increase speed but decrease sensitivity.

-gapopen
    Sets the penalty for opening a gap in an alignment.

-gapextend
    Sets the penalty for extending an existing gap in an alignment.

-seg
    Enables or disables filtering of low-complexity regions in the query sequence (e.g., seg yes or seg no).

-qcov_hsp_perc
    Sets a threshold for the minimum query coverage percentage for high-scoring pairs (HSPs) to be reported.

-max_hsps
    Limits the maximum number of HSPs to keep per subject sequence.

-remote
    Instructs tblastn to run the search on the NCBI remote server instead of locally. Requires an active internet connection.

DESCRIPTION

tblastn is a powerful command-line tool within the NCBI BLAST+ suite, designed for comparing a protein query sequence against a dynamically translated nucleotide sequence database. Unlike blastn (nucleotide vs. nucleotide) or blastp (protein vs. protein), tblastn bridges the gap by translating the subject nucleotide sequences in all six possible reading frames before performing a protein-protein alignment. This unique functionality allows researchers to identify potential homologous protein-coding regions within raw genomic or transcriptomic DNA sequences, even when the exact gene structure or open reading frame is unknown.

The primary purpose of tblastn is to detect distant evolutionary relationships between a known protein and potential coding sequences in a nucleotide database. It's particularly useful in gene prediction, annotation of novel genomes, and comparative genomics studies where protein sequences from one organism are used to find corresponding protein-coding genes in the untranslated genome of another. By working at the protein level, tblastn is more sensitive in detecting remote similarities than nucleotide-level searches, as protein sequences conserve functional residues better over evolutionary time and are less affected by synonymous substitutions.

CAVEATS

Computational Intensity: tblastn can be resource-intensive, especially when searching large protein queries against massive nucleotide databases, due to the on-the-fly translation of the database in six frames.

Translational Ambiguity and Frameshifts: Since the nucleotide database is translated in all six frames, results might include hits from non-coding regions or sequences containing frameshifts or premature stop codons, requiring careful biological interpretation.

Database Preparation: The nucleotide database must be pre-formatted using the makeblastdb command. An unformatted FASTA file cannot be directly used as a database.

Intron/Exon Structure: When searching against genomic DNA, tblastn does not account for introns. An alignment might span across an intron, or a single coding region could be split across multiple HSPs if introns are present.

<I>CHOOSING THE RIGHT SCORING MATRIX</I>

For protein-protein alignments, the choice of scoring matrix (e.g., BLOSUM62, PAM30, PAM70) is crucial. BLOSUM62 is generally recommended for detecting moderately distant relationships, while PAM matrices are better suited for very close (PAM30) or very distant (PAM250) relationships. The -matrix option allows this specification.

<I>UNDERSTANDING OUTPUT FORMATS</I>

The -outfmt option is vital for downstream analysis. Format 0 (pairwise) is human-readable, 6 (tabular) is commonly used for parsing results in scripts or spreadsheets, and 7 (tabular with comments) provides additional header information. Selecting the appropriate format streamlines data interpretation and integration into bioinformatics pipelines.

<I>LOCAL VS. REMOTE SEARCHES</I>

While tblastn is typically run locally against a pre-built database, the -remote option allows queries against NCBI's continuously updated databases without local database management. This is convenient for small-scale queries but may have limitations on query size and speed depending on network conditions and NCBI server load.

HISTORY

tblastn is an integral component of the Basic Local Alignment Search Tool (BLAST) suite, which was first introduced by Stephen Altschul and colleagues at the National Center for Biotechnology Information (NCBI) in 1990. The concept of translating a nucleotide database on the fly and searching it with a protein query was a significant innovation, allowing for powerful cross-type sequence comparisons. The original BLAST programs were re-engineered into the "BLAST+" suite, which was released around 2009. This rewrite aimed to improve performance, modularity, and maintainability, providing a more robust and flexible toolkit for bioinformatics research. tblastn continues to be a cornerstone in gene discovery, functional annotation, and comparative genomics, evolving with advancements in sequence data volume and computational capabilities.

SEE ALSO

Copied to clipboard