Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 22;16(12):e0261548.
doi: 10.1371/journal.pone.0261548. eCollection 2021.

NGS read classification using AI

Affiliations

NGS read classification using AI

Benjamin Voigt et al. PLoS One. .

Erratum in

  • Correction: NGS read classification using AI.
    Voigt B, Fischer O, Krumnow C, Herta C, Dabrowski PW. Voigt B, et al. PLoS One. 2024 Apr 1;19(4):e0301793. doi: 10.1371/journal.pone.0301793. eCollection 2024. PLoS One. 2024. PMID: 38557766 Free PMC article.

Abstract

Clinical metagenomics is a powerful diagnostic tool, as it offers an open view into all DNA in a patient's sample. This allows the detection of pathogens that would slip through the cracks of classical specific assays. However, due to this unspecific nature of metagenomic sequencing, a huge amount of unspecific data is generated during the sequencing itself and the diagnosis only takes place at the data analysis stage where relevant sequences are filtered out. Typically, this is done by comparison to reference databases. While this approach has been optimized over the past years and works well to detect pathogens that are represented in the used databases, a common challenge in analysing a metagenomic patient sample arises when no pathogen sequences are found: How to determine whether truly no evidence of a pathogen is present in the data or whether the pathogen's genome is simply absent from the database and the sequences in the dataset could thus not be classified? Here, we present a novel approach to this problem of detecting novel pathogens in metagenomic datasets by classifying the (segments of) proteins encoded by the sequences in the datasets. We train a neural network on the sequences of coding sequences, labeled by taxonomic domain, and use this neural network to predict the taxonomic classification of sequences that can not be classified by comparison to a reference database, thus facilitating the detection of potential novel pathogens.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the complete neural network classification pipeline.
The pipeline consists of four major blocks: (1) preprocessing NGS reads, (2) frame classification of NGS reads, (3) frame correction and translation of NGS reads, (4) taxonomic classification of amino acid sequence. The dotted arrow line shows an optional loop of the frame classification used for checking the frame correction block, as shown in Fig 2.
Fig 2
Fig 2. Prediction results of the frame classification model on the test dataset.
Predictions of the most probable class k^ on the frame test data [41] are shown as an error matrix. The classes are as follows: on-frame (0), offset by one base (1), offset by two bases (2), reverse-complementary (3), reverse-complementary and offset by one base (4), reverse complementary and offset by two bases (5). (A) represents the initial classification results of the reads. (B) Re-evaluation of the reads after applying the frame correction to validate that the reads were correctly shifted to be on-frame, i.e., k = 0.
Fig 3
Fig 3. Prediction results of the taxonomic classification model on the test dataset.
Error matrix (A) and ROC curve (B) on the taxonomic test dataset [39] are shown. The classes are as follows: 0—viral, 1—bacterial, 2—mammalian.
Fig 4
Fig 4. Prediction results of the complete classification pipeline on data from real metagenomic sequencing studies.
ROC curves for taxonomic classification in swine feces metagenome (A) and human skin metagenome (B) datasets. The classes are as follows: 0—viral, 1—bacterial, 2—mammalian.

References

    1. Heather JM, Chain B. The sequence of sequencers: The history of sequencing DNA. Genomics. 2016;107(1):1–8. doi: 10.1016/j.ygeno.2015.11.003 - DOI - PMC - PubMed
    1. NCBI. Genbank growth statistics; 2020. Available from: https://www.ncbi.nlm.nih.gov/genbank/statistics/.
    1. NCBI. SRA growth statistics; 2020. Available from: https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/.
    1. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Droge J, et al. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat Methods. 2017;14(11):1063–1071. doi: 10.1038/nmeth.4458 - DOI - PMC - PubMed
    1. Lennon JT, Locey KJ. More support for Earth’s massive microbiome. Biol Direct. 2020;15(1):5. doi: 10.1186/s13062-020-00261-8 - DOI - PMC - PubMed

Publication types

LinkOut - more resources