Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 25:2025.05.20.654611.
doi: 10.1101/2025.05.20.654611.

Partitioned Multi-MUM finding for scalable pangenomics

Affiliations

Partitioned Multi-MUM finding for scalable pangenomics

Vikram S Shivakumar et al. bioRxiv. .

Update in

Abstract

Pangenome collections are growing to hundreds of high-quality genomes. This necessitates scalable methods for constructing pangenome alignments that can incorporate newly-sequenced assemblies. We previously developed Mumemto, which computes maximal unique matches (multi-MUMs) across pangenomes using compressed indexing. In this work, we extend Mumemto by introducing two new partitioning and merging strategies. Both strategies enable highly parallel, memory efficient, and updateable computation of multi-MUMs. One of the strategies, called string-based merging, is also capable of conducting the merges in a way that follows the shape of a phylogenetic tree, naturally yielding the multi-MUM for the tree's internal nodes as well as the root. With these strategies, Mumemto now scales to 474 human haplotypes, the only multi-MUM method able to do so. It also introduces a time-memory tradeoff that allows Mumemto to be tailored to more scenarios, including in resource-limited settings.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
(A) Anchor-based merging requires a common sequence (red) present in each partition. Multi-MUMs are merged by identifying overlaps between partition-specific matches in the anchor coordinate space, and a uniqueness threshold determines if a MUM is still unique in each partition after truncation. (B) String-based merging enables computation of multi-MUMs between partitions without a common sequence. An example tree (left) is shown, highlighting the use case where partial multi-MUMs specific to internal nodes (starred) can be computed by merging subclade-based partitions up a tree. (right) MUM overlaps are computed by running Mumemto on the MUM sequences, and the uniqueness threshold array ensures overlaps remain unique across the merged dataset. (C) An example Burrows-Wheeler Transform (BWT), matrix (BWM), and Longest Common Prefix (LCP) array, with sequence IDs for each suffix shown (ID). A non-maximal unique match (UM) is shown, and the uniqueness threshold for this match is found using the flanking LCP values. (D) A partial multi-MUM (in blue) is found in all-but-one sequence (excluded in red). Using two anchor sequences (red and orange), all-but-one partial MUMs can be computed using an augmented anchor-based merging method (section 2.6).
Figure 2:
Figure 2:
(A) Phylogeny of geographically diverse A. thaliana accessions (Lian et al. 2024), with broad geographical regions colored. Internal nodes are labeled with the coverage of partial multi-MUMs across the leaves of each node. Internal node partial MUMs are computed by merging subtree-based partitions progressively up the phylogeny. (B) Global multi-MUM synteny across the full dataset shown in blue (with inversions in green). Global MUMs are computed by merging all partitions together (representing the root node). Additionally, three geographically distinct subgroups are highlighted and partition-specific multi-MUMs (in purple, with inversions in pink) reveal local structural variation in centromeric regions. Two examples are shown in zoomed-in panels: (D) a novel inversion located in accessions from the Madeira islands, and (E) a complex inversion/rearrangement in the North American subgroup chr5 centromere. (C) Coverage of internal nodes in the phylogeny compared to number of leaves under node. Three highlighted subgroups show particularly high coverage, likely due to geographical isolation (Lian et al. 2024).

References

    1. Abouelhoda MI, Kurtz S Ohlebusch E 2004. “Replacing suffix trees with enhanced suffix arrays”. Journal of Discrete Algorithms 2: pp. 53–86
    1. Alonge M, Lebeigle L, Kirsche M, Jenike K, Ou S, Aganezov S, Wang X, Lippman ZB, Schatz MC Soyk S 2022. “Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing”. Genome Biology 23: p. 258. - PMC - PubMed
    1. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS Haussler D 2004. “Ultraconserved elements in the human genome”. Science 304: pp. 1321–1325 - PubMed
    1. Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G Mun T 2019. “Prefix-free parsing for building big BWTs”. Algorithms for Molecular Biology 14: pp. 1–15 - PMC - PubMed
    1. Brown NK, Shivakumar VS Langmead B (2025), “Improved Pangenomic Classification Accuracy with Chain Statistics”, in: Research in Computational Molecular Biology, ed. by Sankararaman S, Cham: Springer Nature Switzerland, pp. 190–208

Publication types

LinkOut - more resources