Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar¹, Omar Y Ahmed¹, Sam Kovaka¹, Mohsen Zakeri¹, Ben Langmead¹

Affiliations

PMID: 38940135
PMCID: PMC11211819
DOI: 10.1093/bioinformatics/btae213

Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar et al. Bioinformatics. 2024.

. 2024 Jun 28;40(Suppl 1):i287-i296.

doi: 10.1093/bioinformatics/btae213.

Authors

Vikram S Shivakumar¹, Omar Y Ahmed¹, Sam Kovaka¹, Mohsen Zakeri¹, Ben Langmead¹

Affiliation

¹ Department of Computer Science, Johns Hopkins University, 3400 North Charles St., Baltimore, MD 21218, United States.

PMID: 38940135
PMCID: PMC11211819
DOI: 10.1093/bioinformatics/btae213

Abstract

Summary: Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics, all in linear query time without the need for seed-chain-extend. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes. Sigmoni is the first signal-based tool to scale to a complete human genome and pangenome while remaining fast enough for adaptive sampling applications.

Availability and implementation: Sigmoni is implemented in Python, and is available open-source at https://github.com/vshiv18/sigmoni.

PubMed Disclaimer

Conflict of interest statement

S.K. has received travel funding from Oxford Nanopore Technologies Limited.

Figures

**Figure 1.**
Overview of Sigmoni mapping procedure: (top) Query is discretized into “bins,” which are further converted into arbitrary characters from a small alphabet for exact matching. The reference is digested into k-mers and converted to the same alphabet based on the expected current level. (bottom) The matching length profile (left) defines the exact match length at each position along the query with respect to the reference. Using a “shredded” sampled document array, matches are mapped back to reference regions to identify a cluster of matches. Here, a read maps to Ref 5, which is the predicted reference.

**Figure 2.**
(A) Comparison of mock community multi-class classification confusion matrices for each signal-based method. The diagonal (TP) is omitted, with proportion of reads provided instead to highlight off-diagonal (misclassified) reads. (B) Confusion matrix of human chromosome-level classification of NA12878 reads against CHM13. As the donor individual is female, ChrY was omitted from the reference. NC, not classified.

**Figure 3.**
(A–C) Binary classification of yeast-origin reads from a mock community on “chunks” of signal. Each chunk represents 1 s of sequencing, ∼ 420 bp; (A) F1 score (unclassified reads are considered bacterial in origin), (B) classification speed for increasing length signal chunks, (C) proportion of reads classified by each method. (D–F) Binary classification on a hybrid dataset of human-origin (NA12878) reads and Zymo mock community reads; (D) F1 score (unclassified reads are considered human-origin, as in the case of a “host depletion” experiment), (E) classification speed, (F) read classification rate.

See this image and copyright information in PMC

Update of

Sigmoni: classification of nanopore signal with a compressed pangenome index.
Shivakumar VS, Ahmed OY, Kovaka S, Zakeri M, Langmead B. Shivakumar VS, et al. bioRxiv [Preprint]. 2023 Aug 30:2023.08.15.553308. doi: 10.1101/2023.08.15.553308. bioRxiv. 2023. Update in: Bioinformatics. 2024 Jun 28;40(Suppl 1):i287-i296. doi: 10.1093/bioinformatics/btae213. PMID: 37645873 Free PMC article. Updated. Preprint.

References

1. Ahmed O, Rossi M, Kovaka S. et al. Pan-genomic matching statistics for targeted nanopore sequencing. iScience 2021;24:102696. - PMC - PubMed
1. Ahmed OY, Rossi M, Gagie T. et al. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol 2023;24:122. - PMC - PubMed
1. Alser M, Lindegger J, Firtina C. et al. From molecules to genomic variations: accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022;20:4579–99. - PMC - PubMed
1. Bao Y, Wadden J, Erb-Downward JR. et al. SquiggleNet: real-time, direct classification of nanopore signals. Genome Biol 2021;22:298. - PMC - PubMed
1. Boucher C, Gagie T, Tomohiro I. et al. PHONI: streamed matching statistics with multi-genome references. Proc Data Compress Conf 2021;2021:193–202. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sigmoni: classification of nanopore signal with a compressed pangenome index

Affiliation

Sigmoni: classification of nanopore signal with a compressed pangenome index

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous