mashtree
Combine multiple directory trees into one
TLDR
Fastest method in mashtree to create a tree from fastq and/or fasta files using multiple threads, piping into a newick file
Most accurate method in mashtree to create a tree from fastq and/or fasta files using multiple threads, piping into a newick file
Most accurate method to create a tree with confidence values (note that any options for mashtree itself has to be on the right side of the --)
SYNOPSIS
mash dist [options] <sketch.msh> ... | mashtree [options]
mashtree [options] <distance.tab>
mashtree [options] --input-is-sketch <sketch.msh> ...
PARAMETERS
-h, --help
Display help information and exit.
-o, --output FILE
Write the resulting Newick tree to FILE instead of standard output.
-b, --bootstrap N
Perform N bootstrap replicates for tree inference.
-i, --input-is-sketch
Treat input files as Mash sketch files (.msh) instead of a distance matrix. If this option is used, mashtree will automatically calculate pairwise distances internally.
-f, --format FORMAT
Specify the input format for distances (e.g., 'phylip' for a Phylip-style distance matrix).
--temp-dir DIR
Specify a directory for temporary files used during tree construction.
--phylip-exe PATH
Provide the path to the Phylip 'neighbor' executable if it's not found in your system's PATH.
DESCRIPTION
The mashtree command is a specialized utility within the Mash bioinformatics toolkit. Its primary function is to construct a phylogenetic tree based on pairwise genomic distances calculated using the Mash MinHash algorithm. It typically consumes the tab-separated distance matrix produced by mash dist. Internally, mashtree leverages the neighbor-joining algorithm, often relying on the Phylip package's `neighbor` program, to infer evolutionary relationships. This makes it an incredibly fast method for creating large-scale phylogenetic trees, circumventing the need for time-consuming multiple sequence alignments, which is crucial for big genomic datasets. The output is a tree in Newick format, suitable for visualization with standard tree viewers. It's ideal for rapid initial exploration of evolutionary relationships among many genomes or metagenomic samples.
CAVEATS
mashtree relies on the output of mash dist, which provides approximate distances based on MinHash. While extremely fast, neighbor-joining trees can be less accurate than alignment-based methods under certain complex evolutionary models or with highly divergent sequences. For robust phylogenetic inference, alignment-based methods might be preferred. It often requires the Phylip package's `neighbor` program to be installed and accessible on the system or specified via `--phylip-exe`.
TYPICAL WORKFLOW
A common workflow involves first creating Mash sketches from input genomes using `mash sketch`, then calculating pairwise distances using `mash dist`, and finally piping the output to `mashtree` to generate a phylogenetic tree. For example:mash sketch -o genomes.msh genome*.fna && mash dist genomes.msh genomes.msh | mashtree > tree.newick
OUTPUT FORMAT
The tree is output in the standard Newick format, a widely adopted text-based format for representing phylogenetic trees. Newick files can be visualized and further manipulated using various dedicated phylogenetic tree viewers such as FigTree, iTOL, Dendroscope, or ETE Toolkit.
HISTORY
mashtree is an integral part of the Mash software package, developed by Adam Phillippy and colleagues at the National Institutes of Health. Mash was first introduced in 2016 (Ondov et al., Genome Biology), addressing the critical challenge of rapid genomic distance estimation for massive sequence datasets. mashtree was subsequently developed as a complementary tool to quickly visualize these relationships, enabling swift exploration of phylogenetic structures without the substantial computational burden of traditional alignment-based methods.
SEE ALSO
mash(1), mash dist(1), phylip(1)