compseq
Search DNA/RNA sequence databases
TLDR
Count observed frequencies of words in a FASTA file, providing parameter values with interactive prompt
Count observed frequencies of amino acid pairs from a FASTA file, save output to a text file
Count observed frequencies of hexanucleotides from a FASTA file, save output to a text file and ignore zero counts
Count observed frequencies of codons in a particular reading frame; ignoring any overlapping counts (i.e. move window across by word-length 3)
Count observed frequencies of codons frame-shifted by 3 positions; ignoring any overlapping counts (should report all codons except the first one)
Count amino acid triplets in a FASTA file and compare to a previous run of compseq to calculate expected and normalised frequency values
Approximate the above command without a previously prepared file, by calculating expected frequencies using the single base/residue frequencies in the supplied input sequence(s)
Display help (use -help -verbose for more information on associated and general qualifiers)
SYNOPSIS
compseq -sequence <input_file> [-samelength <N>] [-lessthan <N>] [-greaterthan <N>] [-output <output_file>] [-summary] [-auto] [-stdout]
PARAMETERS
-sequence <input_file>
Required. Specifies the input sequence file (e.g., FASTA, GenBank) to be analyzed.
-samelength <N>
Outputs sequences whose length is exactly <N> bases/amino acids.
-lessthan <N>
Outputs sequences whose length is less than <N> bases/amino acids.
-greaterthan <N>
Outputs sequences whose length is greater than <N> bases/amino acids.
-output <output_file>
Specifies the output file for filtered sequences or the summary report. If not specified, output typically goes to stdout unless -auto is used.
-summary
Outputs a summary report instead of the actual sequences. The report includes counts for sequences matching each length criterion.
-auto
Runs the program automatically without interactive prompts, often used in scripts. Directs output to a generated file name based on input unless -output or -stdout is used.
-stdout
Directs all output to the standard output, overriding the default output file behavior for -auto or -output.
DESCRIPTION
compseq is a command-line utility from the EMBOSS (European Molecular Biology Open Software Suite) package, widely used in bioinformatics. Its primary function is to analyze biological sequence files (e.g., DNA, RNA, protein) and filter sequences based on their lengths. Users can specify criteria to output sequences that are either exactly equal to, shorter than, or longer than a given numeric length.
This tool is invaluable for quality control in genomics and proteomics workflows, enabling researchers to quickly identify and extract sequences of a specific size range, or to discard sequences that do not meet length requirements for downstream analysis. Beyond filtering sequences, compseq can also generate a concise summary report detailing the counts of sequences that satisfy each specified length criterion, providing a quick statistical overview of the dataset without writing out the actual sequences.
CAVEATS
compseq is part of the EMBOSS suite, which may not be pre-installed on all Linux distributions; it typically needs to be installed separately via a package manager.
The command is primarily designed for biological sequences (DNA, RNA, protein); its utility outside this domain for general text file length comparisons is limited compared to general-purpose text processing tools like awk or grep.
When multiple length criteria flags (e.g., -lessthan and -greaterthan) are used, the program will output sequences that satisfy any of the specified criteria, effectively performing a logical OR operation.
COMBINING LENGTH CRITERIA
When using multiple length specification flags (e.g., -lessthan, -greaterthan, and -samelength) in a single command, compseq will output sequences that meet any of the specified conditions. It performs a logical OR operation between the criteria, rather than an AND. For example, compseq -sequence file.fasta -lessthan 100 -greaterthan 500
would output sequences shorter than 100 OR longer than 500.
OUTPUT FILE NAMING WITH -AUTO
When the -auto option is used without explicitly specifying an output file via -output or directing to standard output with -stdout, compseq will automatically generate an output filename. This name typically combines the input filename with a program-specific suffix, such as <input_name>.compseq for sequence output or <input_name>.compseq_summary for summary reports.
HISTORY
compseq is a key component of the EMBOSS (European Molecular Biology Open Software Suite), a comprehensive collection of bioinformatics applications first released in 2000. Developed by the EMBOSS project, the suite provides a robust and freely available set of tools for molecular biology research. compseq has been a consistent utility within this framework, evolving alongside the broader EMBOSS project to support diverse sequence formats and analytical demands in genomics, proteomics, and general molecular biology.