LinuxCommandLibrary

nextclade

Analyze viral sequences to track evolution

TLDR

Align sequences to user provided reference, outputting the alignment to a file

$ nextclade run [path/to/sequences.fa] [[-r|--input-ref]] [path/to/reference.fa] [[-o|--output-fasta]] [path/to/alignment.fa]
copy

Create a TSV report, auto-downloading the latest dataset
$ nextclade run [path/to/fasta] [[-d|--dataset-name]] [dataset_name] [[-t|--output-tsv]] [path/to/report.tsv]
copy

List all available datasets
$ nextclade dataset list
copy

Download the latest SARS-CoV-2 dataset
$ nextclade dataset get [[-n|--name]] sars-cov-2 [[-o|--output-dir]] [path/to/directory]
copy

Use a downloaded dataset, producing all outputs
$ nextclade run [[-D|--input-dataset]] [path/to/dataset_directory] [[-O|--output-all]] [path/to/output_directory] [path/to/sequences.fasta]
copy

Run on multiple files
$ nextclade run [[-d|--dataset-name]] [dataset_name] [[-t|--output-tsv]] [path/to/output_tsv] -- [path/to/input_fasta_1 path/to/input_fasta_2 ...]
copy

Try reverse complement if sequence does not align
$ nextclade run --retry-reverse-complement [[-d|--dataset-name]] [dataset_name] [[-t|--output-tsv]] [path/to/output_tsv] [path/to/input_fasta]
copy

SYNOPSIS

nextclade <SUBCOMMAND> [OPTIONS]

PARAMETERS

run
    Main analysis subcommand for sequences

dataset
    Manage analysis datasets (get, list, update)

--help, -h
    Print help information

--version, -V
    Print version information

--dataset-name <NAME>
    Dataset name (e.g. 'sars-cov-2')

--dataset-url <URL>
    URL to zip dataset

--input-sequences <PATH>
    Input FASTA/FASTQ sequences

--input-reference <PATH>
    Custom reference FASTA

--output-tsv <PATH>
    Tab-separated results

--output-csv <PATH>
    Comma-separated results

--output-json <PATH>
    JSON results

--output-fasta <PATH>
    Aligned FASTA output

--output-tree <PATH>
    Phylogenetic tree (Nexus)

--threads <N>
    Number of CPU threads

--include-endpoint-mutations
    Include mutations outside gene regions

--output-basename <STR>
    Base name for all outputs

DESCRIPTION

Nextclade is a fast, scalable command-line tool for analyzing viral genomes, especially SARS-CoV-2. It processes FASTA/FASTQ input sequences to perform:

Clade assignment (Nextstrain, WHO)
Mutation calling (nucleotide/aa substitutions)
Quality control (scoring missing data, divergences)
Alignment to reference genomes
Pango lineage inference

Using predefined datasets (genes, references, trees), it outputs results in TSV, CSV, JSON, aligned FASTA, and phylogenetic trees (Nexus/Newick). Supports multi-threading for high-throughput surveillance.

Part of the Nextstrain ecosystem, it's used globally for COVID-19 tracking. Datasets auto-update for latest variants. Install via Conda, Cargo, or binaries. Handles thousands of sequences efficiently on standard hardware.

CAVEATS

Requires internet for default datasets; download locally for offline use.
Large inputs may need significant RAM (>8GB recommended).
Primarily optimized for SARS-CoV-2; check dataset compatibility for other viruses.

DATASETS

Prebuilt collections (reference, genes, tree) for viruses. Use nextclade dataset get --name sars-cov-2 to download.

OUTPUTS

Core TSV/JSON fields: clade, pango_lineage, qc_overall_score, substitutions, aa_substitutions.

INSTALLATION

Via Conda: conda install -c bioconda nextclade; Cargo: cargo install nextclade; or prebuilt binaries.

HISTORY

Developed by Nextstrain team in 2020 for COVID-19 surveillance. First release coincided with pandemic; evolved with variants (Alpha to Omicron+). Weekly dataset updates since 2021. Version 2.x introduced multi-virus support.

SEE ALSO

nextalign(1), pangolin(1), iqtree(1)

Copied to clipboard