Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 22;13(1):e0379621.
doi: 10.1128/mbio.03796-21. Epub 2022 Feb 15.

Genome Informatics and Machine Learning-Based Identification of Antimicrobial Resistance-Encoding Features and Virulence Attributes in Escherichia coli Genomes Representing Globally Prevalent Lineages, Including High-Risk Clonal Complexes

Affiliations

Genome Informatics and Machine Learning-Based Identification of Antimicrobial Resistance-Encoding Features and Virulence Attributes in Escherichia coli Genomes Representing Globally Prevalent Lineages, Including High-Risk Clonal Complexes

Sabiha Shaik et al. mBio. .

Abstract

Escherichia coli, a ubiquitous commensal/pathogenic member from the Enterobacteriaceae family, accounts for high infection burden, morbidity, and mortality throughout the world. With emerging multidrug resistance (MDR) on a massive scale, E. coli has been listed as one of the Global Antimicrobial Resistance and Use Surveillance System (GLASS) priority pathogens. Understanding the resistance mechanisms and underlying genomic features appears to be of utmost importance to tackle further spread of these multidrug-resistant superbugs. While a few of the globally prevalent sequence types (STs) of E. coli, such as ST131, ST69, ST405, and ST648, have been previously reported to be highly virulent and harboring MDR, there is no clarity if certain ST lineages have a greater propensity to acquire MDR. In this study, large-scale comparative genomics of a total of 5,653 E. coli genomes from 19 ST lineages revealed ST-wide prevalence patterns of genomic features, such as antimicrobial resistance (AMR)-encoding genes/mutations, virulence genes, integrons, and transposons. Interpretation of the importance of these features using a Random Forest Classifier trained with 11,988 genomic features from whole-genome sequence data identified ST-specific or phylogroup-specific signature proteins mostly belonging to different protein superfamilies, including the toxin-antitoxin systems. Our study provides a comprehensive understanding of a myriad of genomic features, ST-specific proteins, and resistance mechanisms entailing different lineages of E. coli at the level of genomes; this could be of significant downstream importance in understanding the mechanisms of AMR, in clinical discovery, in epidemiology, and in devising control strategies. IMPORTANCE With the leap in whole-genome data being generated, the application of relevant methods to mine biologically significant information from microbial genomes is of utmost importance to public health genomics. Machine-learning methods have been used not only to mine, curate, or classify the data but also to identify the relevant features that could be linked to a particular class/target. This is perhaps one of the pioneering studies that has attempted to classify a large repertoire of E. coli genome data sets (5,653 genomes) belonging to 19 different STs (including well-studied as well as understudied STs) using machine learning approaches. Important features identified by these approaches have revealed ST-specific signature proteins, which could be further studied to predict possible associations with the phenotypic profiles, thereby providing a better understanding of virulence and the resistance mechanisms among different clonal lineages of E. coli.

Keywords: AMR surveillance; Escherichia coli; bacterial evolution; bioinformatics; genomics; machine learning; molecular epidemiology; sequence types; virulence.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

FIG 1
FIG 1
Heat map depicting the resistome profiles of the 5,653 genomes from 19 different STs. Gene names are represented on the y axis and the ST lineage on the x axis. The % presence of each of these genes at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100 and plotted using the matplotlib module. The color key represents % presence.
FIG 2
FIG 2
Heat map depicting the virulome profile of the 5,653 genomes from 19 different STs. Gene names are given on the y axis and the ST lineage on the x axis. The % presence of each of these genes at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100. The color key represents % presence.
FIG 3
FIG 3
Heat map depicting the prevalence of genes linked with secretion systems in the 5,653 genomes from 19 different STs. Gene names are mentioned on the y axis and the ST lineage on the x axis. The % presence of each of these genes at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100. The color key represents % presence.
FIG 4
FIG 4
(A) Bar plot depicting the number of genomes harboring class 1 integrons among the 19 STs considered under this study. (B) Sunburst plot depicting the prevalence of transposons among the 19 STs.
FIG 5
FIG 5
Principal coordinate analysis (PCoA) plot (with the top 2 PCoA axes) showing the clustering of genomes from different STs. A nested scree plot within the image depicts the proportion of variance explained by the top 10 principal components (PCs).
FIG 6
FIG 6
Cluster map depicting the prevalence of validated important features from the RF model among the 19 STs considered under this study. The % presence of each of the proteins at the ST level was calculated using the formula (presence in no. of genomes of ST/total no. of genomes in ST) × 100. Feature names are represented on the y axis and the ST lineages on the x axis. The color key represents % presence.

Similar articles

Cited by

References

    1. Nataro JP, Kaper JB. 1998. Diarrheagenic Escherichia coli. Clin Microbiol Rev 11:142–201. doi:10.1128/CMR.11.1.142. - DOI - PMC - PubMed
    1. Croxen MA, Finlay BB. 2010. Molecular mechanisms of Escherichia coli pathogenicity. Nat Rev Microbiol 8:26–38. doi:10.1038/nrmicro2265. - DOI - PubMed
    1. Frost LS, Leplae R, Summers AO, Toussaint A. 2005. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 3:722–732. doi:10.1038/nrmicro1235. - DOI - PubMed
    1. Maurelli AT. 2007. Black holes, antivirulence genes, and gene inactivation in the evolution of bacterial pathogens. FEMS Microbiol Lett 267:1–8. doi:10.1111/j.1574-6968.2006.00526.x. - DOI - PubMed
    1. Ahmed N, Dobrindt U, Hacker J, Hasnain SE. 2008. Genomic fluidity and pathogenic bacteria: applications in diagnostics, epidemiology and intervention. Nat Rev Microbiol 6:387–394. doi:10.1038/nrmicro1889. - DOI - PubMed