Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 18;24(1):122.
doi: 10.1186/s13059-023-02958-1.

SPUMONI 2: improved classification using a pangenome index of minimizer digests

Affiliations

SPUMONI 2: improved classification using a pangenome index of minimizer digests

Omar Y Ahmed et al. Genome Biol. .

Abstract

Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.

Keywords: Classification; Indexing; Minimizer; Pangenome.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
a Shows the procedure used by SPUMONI 2 to digest a reference into a smaller reference of concatenated minimizers. In practice, SPUMONI 2 would also include the reverse complement of each sequence prior to applying the minimizer scheme. b After generating this minimizer digest which is typically smaller than the original reference, SPUMONI 2 builds an r-index over the minimizer digest which in turns leads to a smaller index
Fig. 2
Fig. 2
a Shows the relative size of the minimizer-based SPUMONI indexes across a range of large window sizes compared to the index size when indexing the full FASTA file. The dataset indexed is a set of 500 Escherichia coli genomes, and the small window size was kept at 4. b Shows the speed-up achieved when using the minimizer-based indexes to query 1 million short E. coli reads [18] against our index compared to querying against an index over the full FASTA file
Fig. 3
Fig. 3
Shows SPUMONI’s binary classification accuracy for indexes of different sizes using different minimizer types. The index was built over 500 Escherichia coli genomes and used minimizer schemes where the large window size ranged from every integer value from 8 to 24. The read set consisted of simulated ONT (mean length = 9000 bp, 95% accuracy) [19] and Illumina reads (150 bp, 99% accuracy) [18] from E. coli and Human. The goal was to classify whether the read was from E. coli or Human
Fig. 4
Fig. 4
a Scatter plot of the 25th percentile of the PML distribution for each contig in a human assembly [21] with respect to a SPUMONI 2 index of contaminants. Contigs where the 25th percentile of its PML distribution is 2 or greater are labeled “Suspicious” and if it is less than 2, the contig is labeled “Normal.” b dot plots showing high-scoring local alignments found with minimap2 [5] for the four suspicious contigs versus sequences in the contaminant pangenome. The suspicious contigs were reported and moved from the assembly in December, 2021
Fig. 5
Fig. 5
Average ratio of class labels found at the read level when matching Illumina reads (150 bp) simulated from eight microbial species. The SPUMONI 2 index consisted of over 6000 reference genomes for the eight microbial species

References

    1. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):1–13. doi: 10.1186/s13059-019-1891-0. - DOI - PMC - PubMed
    1. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26(12):1721–1729. doi: 10.1101/gr.210641.116. - DOI - PMC - PubMed
    1. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7(1):1–9. doi: 10.1038/ncomms11257. - DOI - PMC - PubMed
    1. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–359. doi: 10.1038/nmeth.1923. - DOI - PMC - PubMed
    1. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. doi: 10.1093/bioinformatics/bty191. - DOI - PMC - PubMed

Publication types

LinkOut - more resources