pangolin

Visualize large language model decision making

TLDR

Run pangolin on the specified FASTA file

$ pangolin [path/to/file.fa]

Use the specified analysis engine

$ pangolin --analysis-mode [accurate|fast|pangolearn|usher]

--help, -h
    Display a help message and exit.

--version, -v
    Show program's version number and exit.

--data <path>
    Specify the path to the Pangolin data directory. This directory contains the reference sequences and lineage definitions.

--threads <num>
    Set the number of CPU threads to use for parallel processing during analysis, optimizing performance for large datasets.

--outfile <file>
    Specify the name and path for the output CSV file, where lineage assignments and associated metadata will be written.

--usher
    Utilize UShER (Ultrafast Sample placement on Existing tree of phYlogenies) for lineage assignment, often providing faster results by placing sequences on a global phylogenetic tree.

--analysis-mode <mode>
    Choose the analysis mode (e.g., full, quick), affecting the stringency and speed of the lineage assignment process.

--max-ambig <val>
    Set the maximum allowed percentage of ambiguous bases (N's) in input sequences; sequences exceeding this threshold may be skipped.

--min-length <len>
    Specify the minimum required genome length for sequences to be processed; shorter sequences will be ignored.

--verbose
    Enable verbose output, displaying more detailed information and progress messages during the execution of the command.

DESCRIPTION

Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages) is a specialized bioinformatics command-line tool crucial for global public health surveillance. It is designed for the rapid and accurate assignment of SARS-CoV-2 genome sequences to known global lineages, such as Alpha, Delta, and Omicron variants. Developed by the COG-UK consortium, Pangolin ingests raw genomic sequence data, typically in FASTA format, and analyzes it against an extensive and continually updated phylogenetic tree and lineage database. Its primary output is a CSV file detailing the assigned lineage for each input sequence, along with confidence scores, quality control metrics, and other relevant epidemiological metadata. The tool employs sophisticated algorithms to infer lineage based on characteristic mutations, enabling researchers and public health officials to swiftly track the evolution, spread, and geographical distribution of different SARS-CoV-2 variants, informing vaccine development and public health interventions.

CAVEATS

Pangolin is a specialized bioinformatics tool and not part of standard Linux core utilities. It typically requires installation via pip (Python's package installer) and relies on a Python environment with specific bioinformatics libraries. Its large lineage database requires regular updates to ensure accuracy, which can consume significant disk space and bandwidth. Performance can vary greatly depending on the input sequence count, complexity of mutations, and available system resources. Users should also be mindful of the version of Pangolin and its associated data, as outdated versions may provide less accurate or incomplete lineage assignments.

DATA UPDATES

Pangolin's accuracy is heavily reliant on an up-to-date database of SARS-CoV-2 lineages and their defining mutations. Users must periodically update this data to ensure the tool can accurately assign newly emerging variants. This is typically done using a dedicated command, such as pangolin --update-data, which downloads the latest lineage information from the developers' servers.

INSTALLATION AND ENVIRONMENT

As a Python-based application, Pangolin is commonly installed using pip: pip install pangolin. It's recommended to install it within a dedicated Python virtual environment (e.g., using conda or venv) to manage its dependencies and avoid conflicts with other Python projects. Ensuring all required dependencies are met is crucial for successful operation.

HISTORY

Pangolin was developed by Áine O’Toole and colleagues at the University of Edinburgh, as part of the COG-UK (COVID-19 Genomics UK Consortium) initiative. It was first released in early 2020 as a critical tool for providing rapid lineage assignments during the nascent stages of the SARS-CoV-2 pandemic. Its development was a direct response to the urgent need for a standardized, high-throughput method to classify viral genomes and track the emergence and spread of new variants. Since its inception, Pangolin has undergone continuous updates, with its underlying data and algorithms regularly refined to incorporate new lineages and improve accuracy, making it an indispensable resource for COVID-19 genomic epidemiology worldwide.