This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Nov 2:2024.10.29.620953.

doi: 10.1101/2024.10.29.620953.

Improved pangenomic classification accuracy with chain statistics

Nathaniel K Brown¹, Vikram S Shivakumar¹, Ben Langmead¹

Affiliations

PMID: 39554056
PMCID: PMC11565826
DOI: 10.1101/2024.10.29.620953

Improved pangenomic classification accuracy with chain statistics

Nathaniel K Brown et al. bioRxiv. 2024.

[Preprint]. 2024 Nov 2:2024.10.29.620953.

doi: 10.1101/2024.10.29.620953.

Authors

Nathaniel K Brown¹, Vikram S Shivakumar¹, Ben Langmead¹

Affiliation

¹ Department of Computer Science, Johns Hopkins University, Baltimore MD 21218.

PMID: 39554056
PMCID: PMC11565826
DOI: 10.1101/2024.10.29.620953

Abstract

Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity ("chain") statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an $O (r + n / d)$ -space index for a collection of $d$ sequences having a length- $n$ BWT consisting of $r$ maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignment-based methods.

Keywords: Pangenomics; sequence classification; text indexing.

PubMed Disclaimer

Conflict of interest statement

Disclosure of Interests. Ben Langmead is the founder of InOrder Labs, LLC.

Figures

**Fig. A1.**
Explores the experiments of Section 4.1 by varying the number of bits allocated to store $i d' s$ . The metrics are with respect to classifying co-linear peaks of 32 HPRC chromosome 19’s using 10kbp reads with SNV rate of 1% and significant peak value 20.

**Fig.1.**
**(a)** A set of 4 sequences (“pangenome”) with multi-MUMs highlighted. Note that two multi-MUMs overlap. Also note that our method computes multi-MUMs from the sequences directly, and does not compute a multiple sequence alignment. **(b)** Matching statistic lengths (left) and their triangle-based schematic representation (right). **(c)** Schematic example of matching-statistics lengths derived from a query read and colored according to chain statistics. In this case, the chain statistics give a stronger basis for classifying the read as matching the reference index since matches are consistently from a single (orange) multi-MUM. **(d)** Similar example to (c) but for a read spanning adjacent (red and blue) multi-MUMs. **(e)** Example with no clear evidence of chaining, giving a stronger basis for classifying the read as not originating from the pangenome.

**Fig. 2.**
Forward stepping from multi-MUMs carves out sub-runs which can be marked with an $i d$ . We stop when FL steps are no longer contiguous, signifying the end of a tunnel, or truncated on overlap with another multi-MUM (not pictured). Tunnels were originally motivated as a strong predictor of BWT context [3], since backwards stepping anywhere within the range returns identical characters.

**Fig.3.**
**Left:** Percentage of characters covered by a sub-run when considering only multi-MUM-tunnels (tnl) or all multi-MUM sub-runs (all) using HPRC collections. **Right:** The number of sub-runs (or runs, for BWT) in billions resulting from marking multi-MUMs of HPRC collections, where $s$ is the sub-sampling parameter.

**Fig.4.**
Results for splitting when considering only multi-MUM-tunnels (tnl) or all multi-MUM sub-runs (all). **Left:** The accuracy of identifying co-linear peaks using 32 HPRC chromosome 19 copies using 10kbp reads with SNV rate of 1%. **Right:** The accuracy across collections of HPRC chromosome 19 copies alongside coverage of $i d$ sub-runs using a significant peak value of 20.

See this image and copyright information in PMC

References

1. Ahmed O., Rossi M., Gagie T., Boucher C., Langmead B.: SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology 24(1), 122 (2023) - PMC - PubMed
1. Ahmed O., Rossi M., Kovaka S., Schatz M., Gagie T., Boucher C., Langmead B.: Pan-genomic matching statistics for targeted nanopore sequencing. Iscience 24(6) (2021) - PMC - PubMed
1. Baier U.: On undetected redundancy in the burrows-wheeler transform. In: Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik; (2018)
1. Baier U., Dede K.: Bwt tunnel planning is hard but manageable. 2019 Data Compression Conference (DCC) pp. 142–151 (2019), https://api.semanticscholar.org/CorpusID:155107876
1. Baláž A., Gagie T., Goga A., Heumos S., Navarro G., Petescia A., Sirén J.: Wheeler maps. In: Latin American Symposium on Theoretical Informatics. pp. 178–192. Springer; (2024)

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Improved pangenomic classification accuracy with chain statistics

Affiliation

Improved pangenomic classification accuracy with chain statistics

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous