nextclade
Analyze viral sequences to track evolution
TLDR
Align sequences to user provided reference, outputting the alignment to a file
Create a TSV report, auto-downloading the latest dataset
List all available datasets
Download the latest SARS-CoV-2 dataset
Use a downloaded dataset, producing all outputs
Run on multiple files
Try reverse complement if sequence does not align
SYNOPSIS
nextclade <command> [<options>]
Common commands:
nextclade run [<options>] <input_sequences.fasta>
nextclade dataset get [<options>] <dataset_name>
nextclade --help
nextclade <command> --help
PARAMETERS
--input-fasta <path> or -i <path>
Specifies the input FASTA file containing sequences to be analyzed by nextclade run.
--input-dataset <path> or -d <path>
Path to a local Nextclade dataset directory or a dataset name (e.g., sars-cov-2) to be used for analysis.
--output-json <path>
Specifies the path for the output JSON file containing detailed analysis results for each sequence.
--output-tsv <path>
Specifies the path for the output TSV file (tab-separated values) with tabular results.
--output-csv <path>
Specifies the path for the output CSV file (comma-separated values) with tabular results.
--output-fasta <path>
Specifies the path for the output FASTA file containing aligned sequences after processing.
--output-tree <path>
Specifies the path for the output Newick tree file, if tree building or placement is part of the dataset's configuration.
--output-dir <path>
Specifies a directory where all generated output files will be saved.
--jobs <number> or -j <number>
Specifies the number of parallel jobs (CPU cores) to utilize for processing, speeding up analysis for large inputs.
--verbose or -v
Enables verbose output, providing more detailed progress information during execution.
--dataset-name <name>
(For nextclade dataset get) Specifies the name of the dataset to download (e.g., sars-cov-2, flu).
--dataset-tag <tag>
(For nextclade dataset get) Specifies a particular version tag for the dataset to download, allowing for reproducible analysis.
DESCRIPTION
nextclade is a bioinformatics tool designed for rapid, robust, and reproducible analysis of viral genome sequences. Developed as part of the Nextstrain project, it performs tasks such as phylogenetic placement, clade assignment, mutation calling, and quality control on input sequences.
It aligns sequences against a user-specified reference dataset (which includes a reference sequence, gene map, phylogenetic tree, and quality control configuration) and outputs detailed reports on mutations, amino acid changes, deletions, insertions, and potential quality issues. This tool is crucial for tracking the evolution and spread of rapidly evolving pathogens, providing insights for public health surveillance and research.
CAVEATS
Requires a compatible dataset: nextclade heavily relies on pre-built datasets (reference sequence, gene map, phylogenetic tree, QC parameters). These must be obtained (e.g., via nextclade dataset get) and specifically match the target pathogen for accurate analysis.
Performance: Can be memory and CPU intensive for very large input files or complex datasets, although it supports parallel processing with the --jobs option.
Data Specificity: Primarily designed and optimized for viral sequences due to its integration with specific datasets and reference-based analysis, and less suitable for bacterial or eukaryotic genomes without custom dataset preparation.
DATASET STRUCTURE
nextclade datasets are critical for its operation. They package a reference sequence, gene annotations, a phylogenetic tree, quality control rules, and primerset definitions, enabling specialized analysis for specific pathogens. Users often download these using the nextclade dataset get command.
OUTPUT FORMATS
The tool produces a variety of output formats, including detailed JSON reports (ideal for programmatic access), TSV/CSV for spreadsheet analysis, and FASTA for aligned sequences, facilitating integration into diverse bioinformatics workflows.
HISTORY
nextclade emerged from the Nextstrain project, an open-source initiative providing real-time tracking of pathogen evolution. It was developed to streamline the process of assigning viral sequences to clades, identifying mutations, and performing quality control, complementing the broader phylogenetic analysis capabilities of Nextstrain. Its development has been driven by the need for rapid and consistent genomic surveillance, particularly evident during public health crises like the COVID-19 pandemic.
SEE ALSO
nextstrain(1), minimap2(1), mafft(1), iqtree(1)