Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 8;14(5):e1007333.
doi: 10.1371/journal.pgen.1007333. eCollection 2018 May.

Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica

Affiliations

Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica

Nicole E Wheeler et al. PLoS Genet. .

Abstract

Emerging pathogens are a major threat to public health, however understanding how pathogens adapt to new niches remains a challenge. New methods are urgently required to provide functional insights into pathogens from the massive genomic data sets now being generated from routine pathogen surveillance for epidemiological purposes. Here, we measure the burden of atypical mutations in protein coding genes across independently evolved Salmonella enterica lineages, and use these as input to train a random forest classifier to identify strains associated with extraintestinal disease. Members of the species fall along a continuum, from pathovars which cause gastrointestinal infection and low mortality, associated with a broad host-range, to those that cause invasive infection and high mortality, associated with a narrowed host range. Our random forest classifier learned to perfectly discriminate long-established gastrointestinal and invasive serovars of Salmonella. Additionally, it was able to discriminate recently emerged Salmonella Enteritidis and Typhimurium lineages associated with invasive disease in immunocompromised populations in sub-Saharan Africa, and within-host adaptation to invasive infection. We dissect the architecture of the model to identify the genes that were most informative of phenotype, revealing a common theme of degradation of metabolic pathways in extraintestinal lineages. This approach accurately identifies patterns of gene degradation and diversifying selection specific to invasive serovars that have been captured by more labour-intensive investigations, but can be readily scaled to larger analyses.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the approach employed in this study.
Fig 2
Fig 2. A subset of Salmonella genes are strongly indicative of invasive potential.
A: Out-of-bag votes for phenotype of each serovar cast by each model. Model 1 is the model built using all predictor variables, then each successive model was built using sparsity pruning from the previous model’s predictor variables. Model 5 is the final model with 100% accuracy. Out-of-bag votes include only those votes cast by trees that were not trained on a given sample. The dashed grey line indicates the voting threshold to classify an isolate as invasive. Invasive serovars are coloured in red and gastrointestinal serovars are coloured in blue. B: Of all genes used in the original training dataset, a small minority are given high importance in identifying invasive strains. Variable importance is shown for the top 1000 genes used in the original training set. Variable importance was measured as average decrease in Gini index in a random forest model trained on all orthologous groups that met the inclusion criteria (N = 6,438). C: Functional categories associated with the top predictive genes. D: Mutations in mrcB (penicillin-binding protein 1b), one of the top three predictors. Mutations in different strains are colour-coded, with bars in red indicating a mutation in an extraintestinal strain and bars in blue indicating a mutation in a gastrointestinal strain. An estimate of the effect of the mutation on protein function (DeltaBS) is shown on the y-axis, with positive values indicating higher chance of a mutation impacting protein function. The x-axis represents the length of the protein.
Fig 3
Fig 3. Voting of the model on African iNTS and global gastrointestinal isolates.
A: Maximum likelihood phylogeny of all S. Enteritidis isolates included in the study, annotated with invasiveness ranking and clade (note: Outlier refers to the distinct sister clade of the global epidemic strains identified by [48], while Other refers to strains that don’t belong to a named clade). B: Invasiveness indices for African and non-African clades of Salmonella. Lower and upper boundaries of the boxplots correspond to the 25th and 75th quantiles. C: The proportion of isolates from each tested dataset carrying a hypothetically attenuated coding sequence (HAC, defined by a DeltaBS>3 relative to the reference serovar). Genes are ordered by the amount of degradation observed in African clades. African strains are shown in the positive y-axis in darker grey, global strains are shown in the negative y-axis in lighter grey.
Fig 4
Fig 4. Invasiveness indices and DeltaBS (DBS) values for isolates collected during long term invasive infection of an immunocompromised patient provide evidence for parallel adaptation.
Black points show the increase in the invasiveness index over time. Boxplots show a significant shift in DBS distribution over the duration of carriage for genes selected by our model built from well-characterised invasive serovars as compared to the rest of the proteome. Isolates from [10]. DBS distributions for 2001 have been pooled, but are representative for all three isolates individually. The y-axis for DBS values has been truncated for better visualisation.

References

    1. Frank SA, Schmid-Hempel P. Mechanisms of pathogenesis and the evolution of parasite virulence. J Evol Biol. 2008;21: 396–404. doi: 10.1111/j.1420-9101.2007.01480.x - DOI - PubMed
    1. Fauci AS, Morens DM. The perpetual challenge of infectious diseases. N Engl J Med. 2012;366: 454–461. doi: 10.1056/NEJMra1108296 - DOI - PubMed
    1. Pallen MJ, Wren BW. Bacterial pathogenomics. Nature. nature.com; 2007;449: 835–842. doi: 10.1038/nature06248 - DOI - PubMed
    1. Loman NJ, Pallen MJ. Twenty years of bacterial genome sequencing. Nat Rev Microbiol. 2015;13: 787–794. doi: 10.1038/nrmicro3565 - DOI - PubMed
    1. McNally A, Thomson NR, Reuter S, Wren BW. “Add, stir and reduce”: Yersinia spp. as model bacteria for pathogen evolution. Nat Rev Microbiol. 2016;14: 177–190. doi: 10.1038/nrmicro.2015.29 - DOI - PubMed

Publication types

Substances

LinkOut - more resources