LinuxCommandLibrary

mashtree

Combine multiple directory trees into one

TLDR

Fastest method in mashtree to create a tree from fastq and/or fasta files using multiple threads, piping into a newick file

$ mashtree --numcpus [12] [*.fastq.gz] [*.fasta] > [mashtree.dnd]
copy

Most accurate method in mashtree to create a tree from fastq and/or fasta files using multiple threads, piping into a newick file
$ mashtree --mindepth [0] --numcpus [12] [*.fastq.gz] [*.fasta] > [mashtree.dnd]
copy

Most accurate method to create a tree with confidence values (note that any options for mashtree itself has to be on the right side of the --)
$ mashtree_bootstrap.pl --reps [100] --numcpus [12] [*.fastq.gz] -- --min-depth [0] > [mashtree.bootstrap.dnd]
copy

SYNOPSIS

mash dist [options] <sketch.msh> ... | mashtree [options]
mashtree [options] <distance.tab>
mashtree [options] --input-is-sketch <sketch.msh> ...

PARAMETERS

-h, --help
    Display help information and exit.

-o, --output FILE
    Write the resulting Newick tree to FILE instead of standard output.

-b, --bootstrap N
    Perform N bootstrap replicates for tree inference.

-i, --input-is-sketch
    Treat input files as Mash sketch files (.msh) instead of a distance matrix. If this option is used, mashtree will automatically calculate pairwise distances internally.

-f, --format FORMAT
    Specify the input format for distances (e.g., 'phylip' for a Phylip-style distance matrix).

--temp-dir DIR
    Specify a directory for temporary files used during tree construction.

--phylip-exe PATH
    Provide the path to the Phylip 'neighbor' executable if it's not found in your system's PATH.

DESCRIPTION

The mashtree command is a specialized utility within the Mash bioinformatics toolkit. Its primary function is to construct a phylogenetic tree based on pairwise genomic distances calculated using the Mash MinHash algorithm. It typically consumes the tab-separated distance matrix produced by mash dist. Internally, mashtree leverages the neighbor-joining algorithm, often relying on the Phylip package's `neighbor` program, to infer evolutionary relationships. This makes it an incredibly fast method for creating large-scale phylogenetic trees, circumventing the need for time-consuming multiple sequence alignments, which is crucial for big genomic datasets. The output is a tree in Newick format, suitable for visualization with standard tree viewers. It's ideal for rapid initial exploration of evolutionary relationships among many genomes or metagenomic samples.

CAVEATS

mashtree relies on the output of mash dist, which provides approximate distances based on MinHash. While extremely fast, neighbor-joining trees can be less accurate than alignment-based methods under certain complex evolutionary models or with highly divergent sequences. For robust phylogenetic inference, alignment-based methods might be preferred. It often requires the Phylip package's `neighbor` program to be installed and accessible on the system or specified via `--phylip-exe`.

TYPICAL WORKFLOW

A common workflow involves first creating Mash sketches from input genomes using `mash sketch`, then calculating pairwise distances using `mash dist`, and finally piping the output to `mashtree` to generate a phylogenetic tree. For example:
mash sketch -o genomes.msh genome*.fna && mash dist genomes.msh genomes.msh | mashtree > tree.newick

OUTPUT FORMAT

The tree is output in the standard Newick format, a widely adopted text-based format for representing phylogenetic trees. Newick files can be visualized and further manipulated using various dedicated phylogenetic tree viewers such as FigTree, iTOL, Dendroscope, or ETE Toolkit.

HISTORY

mashtree is an integral part of the Mash software package, developed by Adam Phillippy and colleagues at the National Institutes of Health. Mash was first introduced in 2016 (Ondov et al., Genome Biology), addressing the critical challenge of rapid genomic distance estimation for massive sequence datasets. mashtree was subsequently developed as a complementary tool to quickly visualize these relationships, enabling swift exploration of phylogenetic structures without the substantial computational burden of traditional alignment-based methods.

SEE ALSO

mash(1), mash dist(1), phylip(1)

Copied to clipboard