LinuxCommandLibrary

blastx

Search translated nucleotide sequence against protein database

SYNOPSIS

blastx [-h] [-help] [-version] [-query <File_in>] [-db <Str>] [-out <File_out>] [OPTIONS...]

PARAMETERS

-query <File_in>
    Nucleotide query FASTA/FASTQ file

-db <String>
    Protein database name (pre-formatted)

-out <File_out>
    Output file; default stdout

-evalue <Real>
    Expectation threshold (default 10.0)

-outfmt <String>
    Output format (e.g., 6 for tabular)

-num_threads <Int>
    CPU threads (default 1)

-max_target_seqs <Int>
    Max target sequences (default 100)

-max_hsps <Int>
    Max HSPs per subject (default 0, no limit)

-seg <String>
    Low-complexity filter (default yes)

-dust <String>
    Nucleotide low-complexity filter

-soft_masking <Bool>
    Apply soft masking (default true)

-lcase_masking
    Use lower case as masking

-parse_seqids
    Parse query Seq-ids

-query_gencode <Int>
    Genetic code (default 1)

-frame <String>
    Query frame(s): 'F'/'R'/'B'

-num_alignments <Int>
    Number of alignments (legacy)

-num_descriptions <Int>
    Number of descriptions (legacy)

-ungapped
    Ungapped alignments

-word_size <Int>
    Word size (default 3)

-matrix <String>
    Scoring matrix (default BLOSUM62)

-threshold <Int>
    Neighborhood word threshold

-comp_based_stats <Int>
    Composition stats (0-3)

-use_sw_tback
    Use Smith-Waterman traceback

DESCRIPTION

blastx is a command-line tool from NCBI's BLAST+ suite for bioinformatics. It translates a nucleotide query sequence in all six reading frames (three forward, three reverse-complement) into hypothetical proteins and searches a protein database for similar sequences using the BLAST algorithm. This identifies potential protein-coding regions in genomic DNA, ESTs, or cDNA even without prior annotation.

Key advantages include detecting distant homologs despite frameshifts, introns, or sequencing errors. Outputs include alignments with scores, E-values, identities, positives, and gaps. E-value indicates significance: lower is better (e.g., <0.001).

Usage suits large-scale genomic analysis, metagenomics, and functional annotation. Requires pre-built protein databases like nr, swissprot. Computationally intensive; benefits from multi-threading. Integrates with pipelines via tabular outputs for parsing.

CAVEATS

Requires BLAST+ installed and pre-formatted databases (use makeblastdb). Large databases need significant RAM/disk. Six-frame translation increases compute time vs. blastp. Default filters may miss hits; tune with care. Legacy options may be deprecated.

COMMON OUTPUT FORMATS

-outfmt 6: tabular (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore).
-outfmt 0: pairwise.
-outfmt 7: alignment.

DATABASE PREP

Download from NCBI (e.g., nr). Run makeblastdb -in proteins.fasta -dbtype prot -out dbname.

E-VALUE INTERPRETATION

Expected hits by chance: <10^-5 significant. Adjust via -evalue for stringency.

HISTORY

Developed by NCBI Altschul et al. (1990 paper). blastx in original BLAST (1997). BLAST+ 2.2.22+ (2010) replaced legacy C version with faster C++ implementation, better threading, supporting modern formats. Widely used in genomics since.

SEE ALSO

Copied to clipboard