Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Nov 2:2024.10.29.620953.
doi: 10.1101/2024.10.29.620953.

Improved pangenomic classification accuracy with chain statistics

Affiliations

Improved pangenomic classification accuracy with chain statistics

Nathaniel K Brown et al. bioRxiv. .

Abstract

Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity ("chain") statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an O ( r + n / d ) -space index for a collection of d sequences having a length- n BWT consisting of r maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignment-based methods.

Keywords: Pangenomics; sequence classification; text indexing.

PubMed Disclaimer

Conflict of interest statement

Disclosure of Interests. Ben Langmead is the founder of InOrder Labs, LLC.

Figures

Fig. A1.
Fig. A1.
Explores the experiments of Section 4.1 by varying the number of bits allocated to store ids. The metrics are with respect to classifying co-linear peaks of 32 HPRC chromosome 19’s using 10kbp reads with SNV rate of 1% and significant peak value 20.
Fig.1.
Fig.1.
(a) A set of 4 sequences (“pangenome”) with multi-MUMs highlighted. Note that two multi-MUMs overlap. Also note that our method computes multi-MUMs from the sequences directly, and does not compute a multiple sequence alignment. (b) Matching statistic lengths (left) and their triangle-based schematic representation (right). (c) Schematic example of matching-statistics lengths derived from a query read and colored according to chain statistics. In this case, the chain statistics give a stronger basis for classifying the read as matching the reference index since matches are consistently from a single (orange) multi-MUM. (d) Similar example to (c) but for a read spanning adjacent (red and blue) multi-MUMs. (e) Example with no clear evidence of chaining, giving a stronger basis for classifying the read as not originating from the pangenome.
Fig. 2.
Fig. 2.
Forward stepping from multi-MUMs carves out sub-runs which can be marked with an id. We stop when FL steps are no longer contiguous, signifying the end of a tunnel, or truncated on overlap with another multi-MUM (not pictured). Tunnels were originally motivated as a strong predictor of BWT context [3], since backwards stepping anywhere within the range returns identical characters.
Fig.3.
Fig.3.
Left: Percentage of characters covered by a sub-run when considering only multi-MUM-tunnels (tnl) or all multi-MUM sub-runs (all) using HPRC collections. Right: The number of sub-runs (or runs, for BWT) in billions resulting from marking multi-MUMs of HPRC collections, where s is the sub-sampling parameter.
Fig.4.
Fig.4.
Results for splitting when considering only multi-MUM-tunnels (tnl) or all multi-MUM sub-runs (all). Left: The accuracy of identifying co-linear peaks using 32 HPRC chromosome 19 copies using 10kbp reads with SNV rate of 1%. Right: The accuracy across collections of HPRC chromosome 19 copies alongside coverage of id sub-runs using a significant peak value of 20.

References

    1. Ahmed O., Rossi M., Gagie T., Boucher C., Langmead B.: SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology 24(1), 122 (2023) - PMC - PubMed
    1. Ahmed O., Rossi M., Kovaka S., Schatz M., Gagie T., Boucher C., Langmead B.: Pan-genomic matching statistics for targeted nanopore sequencing. Iscience 24(6) (2021) - PMC - PubMed
    1. Baier U.: On undetected redundancy in the burrows-wheeler transform. In: Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik; (2018)
    1. Baier U., Dede K.: Bwt tunnel planning is hard but manageable. 2019 Data Compression Conference (DCC) pp. 142–151 (2019), https://api.semanticscholar.org/CorpusID:155107876
    1. Baláž A., Gagie T., Goga A., Heumos S., Navarro G., Petescia A., Sirén J.: Wheeler maps. In: Latin American Symposium on Theoretical Informatics. pp. 178–192. Springer; (2024)

Publication types

LinkOut - more resources