PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Carlus Deneke¹, Robert Rentzsch¹, Bernhard Y Renard¹

Affiliations

PMID: 28051068
PMCID: PMC5209729
DOI: 10.1038/srep39194

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Carlus Deneke et al. Sci Rep. 2017.

. 2017 Jan 4:7:39194.

doi: 10.1038/srep39194.

Authors

Carlus Deneke¹, Robert Rentzsch¹, Bernhard Y Renard¹

Affiliation

¹ Research Group Bioinformatics (NG4), Robert Koch Institute, 13353, Berlin, Germany.

PMID: 28051068
PMCID: PMC5209729
DOI: 10.1038/srep39194

Abstract

The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.

PubMed Disclaimer

Figures

**Figure 1. Overview of PaPrBaG workflow.**
Reads are simulated from genomes in both the training (left) and prediction (center) workflows, from which features are extracted. The training sequence features together with the associated phenotype labels compose the training database, on which the random forest algorithm trains a pathogenicity classifier. This classifier predicts the pathogenic potential for each read in the test set. From these raw results, the prediction profile, the genome aggregate prediction and a combined prediction can be generated (right).

**Figure 2. Read predictions for all HPs (top) and non-HPs (bottom).**
Each bar shows the number of reads predicted to be of HP (red), non-HP (blue) or unknown (gray) origin. The left-most bars show the ground truth. Strikingly, Bowtie2, Pathoscope2 and Kraken fail to classify the majority of the reads. Kraken-16 and BLAST still miss a considerable fraction of reads whereas the machine learning based approaches always return a prediction. All methods show true and false predictions to a varying extent. While PaPrBaG shows similar errors for both HPs and non-HPs, all other methods suffer from a substantial bias. Few reads from HPs are falsely classified as non-HPs. Conversely, for non-HPs, the number of falsely classified reads is similar to or even exceeds the number of correctly classified reads.

**Figure 3. Performance effects of taxonomic complexity.**
This re-analyses method performance as presented in Table 1 by taking into account the complexity of the respective taxonomic environment. The first set of columns represents those test genomes with one or more training genomes from the same genus that all show a matching phenotype (291 cases), the second set those with at least one training genome for each phenotype from the same genus (51 cases), and the rightmost set those with no other member of the same genus in the training set (80 cases). PaPrBaG maintains stable accuracy in all three settings, clearly outperforming the other approaches when closely related species have different phenotypes or no closely related species exist at all.

**Figure 4. Classification performance for different genome coverages.**
As coverage decreases, so does the performance of Bowtie2, Pathoscope2 and Kraken. Conversely, BLAST and PaPrBaG still deliver sound results at coverages as low as 0.001. The triangles show results for the consensus filter when combining PaPrBaG and Kraken. It achieves high performances at all coverage levels, however, at the cost of filtering out more and more data.

**Figure 5. Fidelity of prediction certainty.**
Each prediction is associated with uncertainty. Here, we pooled predictions within each certainty interval and measured the prediction performance (MCC). PaPrBaG, Kraken and BLAST show a steady increase in performance with increasing certainty. PaPrBaG achieves the highest MCC among all methods compared.

**Figure 6. Classification with minimum detection threshold.**
Predictions are only made for genomes where the read evidence supporting a phenotype exceeds the detection threshold (given relative to the total number of reads). Initially, most approaches show high informedness, which is a joint measure of sensitivity and specificity defined as I = TPR + TNR − 1. As the detection threshold is increased above 0.1, Bowtie2, Pathoscope and Kraken yield insufficient numbers of reads with phenotype evidence and they are no longer informative. Only PaPrBaG and BLAST show an informedness above 0.5. For most values of the detection threshold, PaPrBaG exhibits the highest informedness.

See this image and copyright information in PMC

References

1. Vouga M. & Greub G. Emerging bacterial pathogens: the past and beyond. Clinical Microbiology and Infection 22, 12–21 (2016). - PMC - PubMed
1. Juhas M. Horizontal gene transfer in human pathogens. Critical Reviews in Microbiology 41, 101–108 (2015). - PubMed
1. Merhej V., Georgiades K. & Raoult D. Postgenomic analysis of bacterial pathogens repertoire reveals genome reduction rather than virulence factors. Briefings in Functional Genomics 12, 291–304 (2013). - PubMed
1. Iraola G., Vazquez G., Spangenberg L. & Naya H. Reduced Set of Virulence Genes Allows High Accuracy Prediction of Bacterial Pathogenicity in Humans. PLoS ONE 7, e42144 (2012). - PMC - PubMed
1. Kanehisa M. et al.. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research 42, D199–205 (2014). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Affiliation

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical