Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Aug 30:2023.08.15.553308.
doi: 10.1101/2023.08.15.553308.

Sigmoni: classification of nanopore signal with a compressed pangenome index

Affiliations

Sigmoni: classification of nanopore signal with a compressed pangenome index

Vikram S Shivakumar et al. bioRxiv. .

Update in

Abstract

Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes.

PubMed Disclaimer

Conflict of interest statement

9 Competing interests SK has received travel funding from Oxford Nanopore Technologies Limited.

Figures

Figure 1:
Figure 1:
Overview of Sigmoni mapping procedure: (top) Query is discretized into “bins”, which are further converted into arbitrary characters from a small alphabet for exact matching. The reference is digested into k-mers and converted to the same alphabet based on the expected current level. (bottom) The matching length profile (left) defines the exact match length at each position along the query with respect to the reference. Using a “shredded” sampled document array, matches are mapped back to reference regions to identify a cluster of matches. Here, a read maps to Ref 5, which is the predicted reference (in red).
Figure 2:
Figure 2:
(A) Comparison of of mock community multi-class classification confusion matrices for each signal-based method. The diagonal (TP) is omitted, with proportion of reads provided instead to highlight off-diagonal (misclassified) reads. (B) Confusion matrix of human chromosome-level classification of NA12878 reads against CHM13. As the donor individual is female, ChrY was omitted from the reference. NC = not classified.
Figure 3:
Figure 3:
(A-C) Binary classification of yeast-origin reads from a mock community on “chunks” of signal. Each chunk represents 1 second of sequencing, ~ 420bp; (A) F1 score (unclassified reads are considered bacterial in origin), (B) classification speed for increasing length signal chunks, (C) proportion of reads classified by each method. (D-F) Binary classification on a hybrid dataset of human-origin (NA12878) reads and Zymo mock community reads; (D) F1 score (unclassified reads are considered human-origin, as in the case of a “host depletion” experiment), (E) classification speed, (F) read classification rate.

References

    1. Wood D. E., Lu J., and Langmead B., “Improved metagenomic analysis with kraken 2,” Genome biology, vol. 20, no. 1, pp. 1–13, 2019. - PMC - PubMed
    1. Kim D., Song L., Breitwieser F. P., and Salzberg S. L., “Centrifuge: Rapid and sensitive classification of metagenomic sequences,” Genome research, vol. 26, no. 12, pp. 1721–1729, 2016. - PMC - PubMed
    1. Menzel P., Ng K. L., and Krogh A., “Fast and sensitive taxonomic classification for metagenomics with kaiju,” Nature communications, vol. 7, no. 1, pp. 1–9, 2016. - PMC - PubMed
    1. Ahmed O., Rossi M., Kovaka S., Schatz M. C., Gagie T., Boucher C., and Langmead B., “Pan-genomic matching statistics for targeted nanopore sequencing,” iScience, vol. 24, no. 6, p. 102 696, Jun. 2021. - PMC - PubMed
    1. Kovaka S., Fan Y., Ni B., Timp W., and Schatz M. C., “Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED,” Nat Biotechnol, vol. 39, no. 4, pp. 431–441, Apr. 2021. - PMC - PubMed

Publication types