LinuxCommandLibrary

nextclade

Analyze viral sequences to track evolution

TLDR

Align sequences to user provided reference, outputting the alignment to a file

$ nextclade run [path/to/sequences.fa] [[-r|--input-ref]] [path/to/reference.fa] [[-o|--output-fasta]] [path/to/alignment.fa]
copy

Create a TSV report, auto-downloading the latest dataset
$ nextclade run [path/to/fasta] [[-d|--dataset-name]] [dataset_name] [[-t|--output-tsv]] [path/to/report.tsv]
copy

List all available datasets
$ nextclade dataset list
copy

Download the latest SARS-CoV-2 dataset
$ nextclade dataset get [[-n|--name]] sars-cov-2 [[-o|--output-dir]] [path/to/directory]
copy

Use a downloaded dataset, producing all outputs
$ nextclade run [[-D|--input-dataset]] [path/to/dataset_dir] [[-O|--output-all]] [path/to/output_dir] [path/to/sequences.fasta]
copy

Run on multiple files
$ nextclade run [[-d|--dataset-name]] [dataset_name] [[-t|--output-tsv]] [path/to/output_tsv] -- [path/to/input_fasta_1 path/to/input_fasta_2 ...]
copy

Try reverse complement if sequence does not align
$ nextclade run --retry-reverse-complement [[-d|--dataset-name]] [dataset_name] [[-t|--output-tsv]] [path/to/output_tsv] [path/to/input_fasta]
copy

SYNOPSIS

nextclade <command> [<options>]

Common commands:
nextclade run [<options>] <input_sequences.fasta>
nextclade dataset get [<options>] <dataset_name>
nextclade --help
nextclade <command> --help

PARAMETERS

--input-fasta <path> or -i <path>
    Specifies the input FASTA file containing sequences to be analyzed by nextclade run.

--input-dataset <path> or -d <path>
    Path to a local Nextclade dataset directory or a dataset name (e.g., sars-cov-2) to be used for analysis.

--output-json <path>
    Specifies the path for the output JSON file containing detailed analysis results for each sequence.

--output-tsv <path>
    Specifies the path for the output TSV file (tab-separated values) with tabular results.

--output-csv <path>
    Specifies the path for the output CSV file (comma-separated values) with tabular results.

--output-fasta <path>
    Specifies the path for the output FASTA file containing aligned sequences after processing.

--output-tree <path>
    Specifies the path for the output Newick tree file, if tree building or placement is part of the dataset's configuration.

--output-dir <path>
    Specifies a directory where all generated output files will be saved.

--jobs <number> or -j <number>
    Specifies the number of parallel jobs (CPU cores) to utilize for processing, speeding up analysis for large inputs.

--verbose or -v
    Enables verbose output, providing more detailed progress information during execution.

--dataset-name <name>
    (For nextclade dataset get) Specifies the name of the dataset to download (e.g., sars-cov-2, flu).

--dataset-tag <tag>
    (For nextclade dataset get) Specifies a particular version tag for the dataset to download, allowing for reproducible analysis.

DESCRIPTION

nextclade is a bioinformatics tool designed for rapid, robust, and reproducible analysis of viral genome sequences. Developed as part of the Nextstrain project, it performs tasks such as phylogenetic placement, clade assignment, mutation calling, and quality control on input sequences.

It aligns sequences against a user-specified reference dataset (which includes a reference sequence, gene map, phylogenetic tree, and quality control configuration) and outputs detailed reports on mutations, amino acid changes, deletions, insertions, and potential quality issues. This tool is crucial for tracking the evolution and spread of rapidly evolving pathogens, providing insights for public health surveillance and research.

CAVEATS

Requires a compatible dataset: nextclade heavily relies on pre-built datasets (reference sequence, gene map, phylogenetic tree, QC parameters). These must be obtained (e.g., via nextclade dataset get) and specifically match the target pathogen for accurate analysis.

Performance: Can be memory and CPU intensive for very large input files or complex datasets, although it supports parallel processing with the --jobs option.

Data Specificity: Primarily designed and optimized for viral sequences due to its integration with specific datasets and reference-based analysis, and less suitable for bacterial or eukaryotic genomes without custom dataset preparation.

DATASET STRUCTURE

nextclade datasets are critical for its operation. They package a reference sequence, gene annotations, a phylogenetic tree, quality control rules, and primerset definitions, enabling specialized analysis for specific pathogens. Users often download these using the nextclade dataset get command.

OUTPUT FORMATS

The tool produces a variety of output formats, including detailed JSON reports (ideal for programmatic access), TSV/CSV for spreadsheet analysis, and FASTA for aligned sequences, facilitating integration into diverse bioinformatics workflows.

HISTORY

nextclade emerged from the Nextstrain project, an open-source initiative providing real-time tracking of pathogen evolution. It was developed to streamline the process of assigning viral sequences to clades, identifying mutations, and performing quality control, complementing the broader phylogenetic analysis capabilities of Nextstrain. Its development has been driven by the need for rapid and consistent genomic surveillance, particularly evident during public health crises like the COVID-19 pandemic.

SEE ALSO

nextstrain(1), minimap2(1), mafft(1), iqtree(1)

Copied to clipboard