This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Aug 30:2023.08.15.553308.

doi: 10.1101/2023.08.15.553308.

Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar¹, Omar Y Ahmed¹, Sam Kovaka¹, Mohsen Zakeri¹, Ben Langmead¹

Affiliations

PMID: 37645873
PMCID: PMC10462034
DOI: 10.1101/2023.08.15.553308

Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar et al. bioRxiv. 2023.

[Preprint]. 2023 Aug 30:2023.08.15.553308.

doi: 10.1101/2023.08.15.553308.

Authors

Vikram S Shivakumar¹, Omar Y Ahmed¹, Sam Kovaka¹, Mohsen Zakeri¹, Ben Langmead¹

Affiliation

¹ Department of Computer Science, Johns Hopkins University.

PMID: 37645873
PMCID: PMC10462034
DOI: 10.1101/2023.08.15.553308

Update in

Sigmoni: classification of nanopore signal with a compressed pangenome index.
Shivakumar VS, Ahmed OY, Kovaka S, Zakeri M, Langmead B. Shivakumar VS, et al. Bioinformatics. 2024 Jun 28;40(Suppl 1):i287-i296. doi: 10.1093/bioinformatics/btae213. Bioinformatics. 2024. PMID: 38940135 Free PMC article.

Abstract

Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes.

PubMed Disclaimer

Conflict of interest statement

9 Competing interests SK has received travel funding from Oxford Nanopore Technologies Limited.

Figures

**Figure 1:**
Overview of Sigmoni mapping procedure: (top) Query is discretized into “bins”, which are further converted into arbitrary characters from a small alphabet for exact matching. The reference is digested into k-mers and converted to the same alphabet based on the expected current level. (bottom) The matching length profile (left) defines the exact match length at each position along the query with respect to the reference. Using a “shredded” sampled document array, matches are mapped back to reference regions to identify a cluster of matches. Here, a read maps to Ref 5, which is the predicted reference (in red).

**Figure 2:**
**(A)** Comparison of of mock community multi-class classification confusion matrices for each signal-based method. The diagonal (TP) is omitted, with proportion of reads provided instead to highlight off-diagonal (misclassified) reads. **(B)** Confusion matrix of human chromosome-level classification of NA12878 reads against CHM13. As the donor individual is female, ChrY was omitted from the reference. NC = not classified.

**Figure 3:**
**(A-C)** Binary classification of yeast-origin reads from a mock community on “chunks” of signal. Each chunk represents 1 second of sequencing, ~ 420bp; **(A)** F1 score (unclassified reads are considered bacterial in origin), **(B)** classification speed for increasing length signal chunks, **(C)** proportion of reads classified by each method. **(D-F)** Binary classification on a hybrid dataset of human-origin (NA12878) reads and Zymo mock community reads; **(D)** F1 score (unclassified reads are considered human-origin, as in the case of a “host depletion” experiment), **(E)** classification speed, **(F)** read classification rate.

See this image and copyright information in PMC

References

1. Wood D. E., Lu J., and Langmead B., “Improved metagenomic analysis with kraken 2,” Genome biology, vol. 20, no. 1, pp. 1–13, 2019. - PMC - PubMed
1. Kim D., Song L., Breitwieser F. P., and Salzberg S. L., “Centrifuge: Rapid and sensitive classification of metagenomic sequences,” Genome research, vol. 26, no. 12, pp. 1721–1729, 2016. - PMC - PubMed
1. Menzel P., Ng K. L., and Krogh A., “Fast and sensitive taxonomic classification for metagenomics with kaiju,” Nature communications, vol. 7, no. 1, pp. 1–9, 2016. - PMC - PubMed
1. Ahmed O., Rossi M., Kovaka S., Schatz M. C., Gagie T., Boucher C., and Langmead B., “Pan-genomic matching statistics for targeted nanopore sequencing,” iScience, vol. 24, no. 6, p. 102 696, Jun. 2021. - PMC - PubMed
1. Kovaka S., Fan Y., Ni B., Timp W., and Schatz M. C., “Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED,” Nat Biotechnol, vol. 39, no. 4, pp. 431–441, Apr. 2021. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Sigmoni: classification of nanopore signal with a compressed pangenome index

Affiliation

Sigmoni: classification of nanopore signal with a compressed pangenome index

Authors

Affiliation

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous