Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec 10;4(44):10.21105/joss.01762.
doi: 10.21105/joss.01762.

Mashtree: a rapid comparison of whole genome sequence files

Affiliations

Mashtree: a rapid comparison of whole genome sequence files

Lee S Katz et al. J Open Source Softw. .
No abstract available

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
The Mashtree workflow. Step 1) Sketch genomes with Mash. In this schematic, there is a green circle representing each genome in the analysis. Filled-in brown circles indicate the presence of a kmer. Missing circles represent true absence. After hashing with a sketch size of six (after the arrow), some kmers are not represented in the Mash sketch either because they are not present in the original genome or because only a finite number of kmers are sketched (e.g., six in this example). Henceforth, truly missing hashes or hashes not included in the Mash sketch are represented by empty circles. Step 2) Calculate distances with Mash dist. Distances in the figure are represented by Jaccard distances, which are calculated as the intersection divided by the union. In this example, the genomes are separated by Jaccard distances of 5/9, 4/9, and 3/9. These Jaccard distances are internally transformed into Mash distances (Ondov et al., 2016). Step 3) Create dendrogram with Quicktree using the Mash distance matrix.
Figure 2:
Figure 2:
The Mashtree bootstrap workflow. Step 1) Generate a tree with the normal workflow as in Figure 1. This is the main tree. Step 2) Run the normal workflow once per replicate but with a different random seed. In this example, the top right replicate differs from the main tree. All ten of these trees are the bootstrap tree replicates. Step 3) For each parent node in the main tree, quantify how many bootstrap tree replicates have the same node with the same children. Record that percentage next to each parent node. This percentage quantifies how confident the Mashtree cluster is, controlling for the random seed in the Mash program.
Figure 3:
Figure 3:
The Mashtree jackknife workflow. Step 1) Generate a tree with the normal workflow as in Figure 1. This is the main tree. Step 2) For each replicate, sample the half hashes without replacement for each query genome. Recalculate the Mash distance between the query genome and all other genomes, reducing the denominator to one half, rounding up, to reflect the smaller pool of hashes. After all genomes have been selected for query genomes, average the distances to create a new distance matrix. Create the dendrogram from the new distance matrix. For brevity, only one detailed replicate is shown. Step 3) For each replication, calculate the new tree from the new distance matrix. In this example, the top right replication differs from the main tree. All ten of these trees are the jackknife tree replicates. Step 4) For each parent node in the main tree, quantify how many jackknife tree replicates have the same node with the same children. Record that percentage next to each parent node. This percentage quantifies how confident Mashtree is at clustering, controlling for stochasticity in hashes.

References

    1. Baker DN, & Langmead B. (2019). Dashing: Fast and accurate genomic distances with hyperloglog. bioRxiv. doi:10.1101/501726 - DOI - PMC - PubMed
    1. Bloom BH (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422–426. doi:10.1145/362686.362692 - DOI
    1. Brown E, Dessai U, McGarry S, & Gerner-Smidt P. (2019). Use of whole-genome sequencing for food safety and public health in the united states. Foodborne Pathogens and Disease, 16(7), 441–450. doi:10.1089/fpd.2019.2662 - DOI - PMC - PubMed
    1. Gardner SN, Slezak T, & Hall BG (2015). KSNP3. 0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome. Bioinformatics, 31(17), 2877–2878. doi:10.1093/bioinformatics/btv271 - DOI - PubMed
    1. Harris SR (2018). SKA: Split kmer analysis toolkit for bacterial genomic epidemiology. BioRxiv, 453142. doi:10.1101/453142 - DOI

LinkOut - more resources