Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 20;17(1):132.
doi: 10.1186/s13059-016-0997-x.

Mash: fast genome and metagenome distance estimation using MinHash

Affiliations

Mash: fast genome and metagenome distance estimation using MinHash

Brian D Ondov et al. Genome Biol. .

Abstract

Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).

Keywords: Alignment; Comparative genomics; Genomic distance; Metagenomics; Nanopore; Sequencing.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overview of the MinHash bottom sketch strategy for estimating the Jaccard index. First, the sequences of two datasets are decomposed into their constituent k-mers (top, blue and red) and each k-mer is passed through a hash function h to obtain a 32- or 64-bit hash, depending on the input k-mer size. The resulting hash sets, A and B, contain |A| and |B| distinct hashes each (small circles). The Jaccard index is simply the fraction of shared hashes (purple) out of all distinct hashes in A and B. This can be approximated by considering a much smaller random sample from the union of A and B. MinHash sketches S(A) and S(B) of size s = 5 are shown for A and B, comprising the five smallest hash values for each (filled circles). Merging S(A) and S(B) to recover the five smallest hash values overall for AB (crossed circles) yields S(AB). Because S(AB) is a random sample of AB, the fraction of elements in S(AB) that are shared by both S(A) and S(B) is an unbiased estimate of J(A,B)
Fig. 2
Fig. 2
Scatterplots illustrating the relationship between ANI and Mash distance for a collection of Escherichia genomes. Each plot column shows a different sketch size s and each plot row a different k-mer size k. Gray lines show the model relationship D = 1–ANI and numbers in the bottom right of each plot give the root-mean-square error versus this perfect model. Blue lines show linear regression models. Increasing the sketch size improves the accuracy of the Mash distance, especially for more divergent sequences. However, there is a limit on how well the Mash distance can approximate ANI, especially for more divergent genomes (e.g. ANI considers only the core genome)
Fig. 3
Fig. 3
Comparison and de novo clustering of all RefSeq genomes using Mash. Each graph node represents a genome. Two genomes are connected by an edge if their Mash distance D ≤0.05 and P value ≤10–10. Graph layout was performed using Cytoscape [61] organic layout algorithm [62]. Individual nodes are colored by species and the top two rows of clusters have been annotated with the majority species label they contain. Only components containing microbial genomes are shown here (including viruses). Additional file 1: Figures S4 and S5 show eukaryotes, orphan plasmids, and organelles
Fig. 4
Fig. 4
Primate trees from the UCSC genome browser and Mash. a A primate phylogenetic tree model from the UCSC genome browser, with branch lengths derived from fourfold degenerate sites extracted from reference gene multiple alignments. b A comparable Mash-based tree generated from whole genomes using a sketch size of s = 1000 and k-mer size of k = 21. Additional file 1: Figure S6 includes this Mash tree with five additional mammals of increasing divergence
Fig. 5
Fig. 5
Metagenomic clustering of ocean and human metagenomes using Mash. a Comparison of Global Ocean Survey (GOS) clustering using Mash (top left) and COMMET (top right) using raw Sanger sequencing data. Heat maps illustrate the pairwise similarity between samples, scaled between 0 (white) and 100 (red) for comparison to COMMET. Sample groups are identified and colored using the same key as in Rusch et al. [35]. The Mash clustering identifies two large clusters of temperate and tropical water samples as well as subgroupings consistent with the original GOS study. b Human metagenomic samples combined from the HMP and MetaHIT projects clustered by Mash from 888 sequencing runs (bottom left) and 879 assemblies (bottom right). For both sequencing reads and assemblies, Mash successfully clusters samples by body site and appropriately clusters MetaHIT and HMP stool samples together, even though these samples are from different projects with different protocols

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. GenBank and WGS Statistics. http://www.ncbi.nlm.nih.gov/genbank/statistics. Accessed 31 May 2016.
    1. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed
    1. Broder AZ. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997 - Proceedings 1998:21–29.
    1. Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. Dallas, TX: ACM; 1998.

Publication types