Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 23;9(1):e0002623.
doi: 10.1128/msystems.00026-23. Epub 2023 Dec 11.

A bacterial sensor taxonomy across earth ecosystems for machine learning applications

Affiliations

A bacterial sensor taxonomy across earth ecosystems for machine learning applications

Helen Park et al. mSystems. .

Abstract

Microbial communities have evolved to colonize all ecosystems of the planet, from the deep sea to the human gut. Microbes survive by sensing, responding, and adapting to immediate environmental cues. This process is driven by signal transduction proteins such as histidine kinases, which use their sensing domains to bind or otherwise detect environmental cues and "transduce" signals to adjust internal processes. We hypothesized that an ecosystem's unique stimuli leave a sensor "fingerprint," able to identify and shed insight on ecosystem conditions. To test this, we collected 20,712 publicly available metagenomes from Host-associated, Environmental, and Engineered ecosystems across the globe. We extracted and clustered the collection's nearly 18M unique sensory domains into 113,712 similar groupings with MMseqs2. We built gradient-boosted decision tree machine learning models and found we could classify the ecosystem type (accuracy: 87%) and predict the levels of different physical parameters (R2 score: 83%) using the sensor cluster abundance as features. Feature importance enables identification of the most predictive sensors to differentiate between ecosystems which can lead to mechanistic interpretations if the sensor domains are well annotated. To demonstrate this, a machine learning model was trained to predict patient's disease state and used to identify domains related to oxygen sensing present in a healthy gut but missing in patients with abnormal conditions. Moreover, since 98.7% of identified sensor domains are uncharacterized, importance ranking can be used to prioritize sensors to determine what ecosystem function they may be sensing. Furthermore, these new predictive sensors can function as targets for novel sensor engineering with applications in biotechnology, ecosystem maintenance, and medicine.IMPORTANCEMicrobes infect, colonize, and proliferate due to their ability to sense and respond quickly to their surroundings. In this research, we extract the sensory proteins from a diverse range of environmental, engineered, and host-associated metagenomes. We trained machine learning classifiers using sensors as features such that it is possible to predict the ecosystem for a metagenome from its sensor profile. We use the optimized model's feature importance to identify the most impactful and predictive sensors in different environments. We next use the sensor profile from human gut metagenomes to classify their disease states and explore which sensors can explain differences between diseases. The sensors most predictive of environmental labels here, most of which correspond to uncharacterized proteins, are a useful starting point for the discovery of important environment signals and the development of possible diagnostic interventions.

Keywords: feature importance; histidine kinase; human microbiome; machine learning; metagenomics; sensory transduction processes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig 1
Fig 1
Overview of the data collection, sensory protein extraction, and curation of the matrix for machine learning. Experimental design. We first surveyed 20,712 metagenomes spanning over 75 ecosystems. We identified each metagenome’s HK proteins using two Pfam conserved domains (“CD”) and then extracted just the sensory domain from each protein using Pfam sensor domain annotations. We clustered 21,984,304 total proteins using MMseqs2 with 17,703,911 unique sensory domain sequences, leading to 113,186 clusters that may each respond to unique stimuli. We focused on clusters that were present in at least 100 metagenomes, dropping any rare features, leading to 14,990 final clusters for analysis for our ML models. These sequence clusters were used for machine learning, hierarchical clustering, and ecosystem sensor taxonomy. In the text, “full-length” proteins contain all sensory domains and conserved domains, while “sensor” references just the sensor domain of the HK protein. Both full-length proteins and isolated sensors used for clustering are protein-coding genes.
Fig 2
Fig 2
Similar sensory-domain clusters can predictably organize ecosystem types by increasing the complexity over Pfam identifiers. (a) The count of sensory clusters associated with each Pfam sensor domain. Many Pfams have a large number of clusters associated with them, for example, pfam13426 (PAS domain) is present in almost 30,000 clusters. The bottom 20 Pfam domains are summed in the final bar of the figure. (b) Shows the number of HK proteins that have a certain number of sensory domains. 43% of proteins have one sensory domain. The graph is split into two ranges to visualize all bars. (c) Heatmap (log2 scale) built using hierarchical clustering. Similar ecosystems group together in the heatmap, indicated with black bars, confirming that the information in the sensor profile matrix lends a predictable structure to the data set. For example, the Environmental:Aquatic metagenomics group, as do Environmental:Terrestrial, and Host-Associated: Human, etc. The Y-axis is ecosystems, and the X-axis is MMseqs2 sensory domain clusters, using the same abundance matrix used in ML training in later sections.
Fig 3
Fig 3
Relative Ecosystem Richness, Ecosystem Typical Sample Richness, and Sensor Fraction using Mmseqs2 clusters (a) Scatter plot of the ecosystem sensor Relative Ecosystem Richness (the proportion of total gene cluster diversity found within that ecosystem) and Ecosystem Typical Sample Richness (the proportion of the ecosystem’s gene cluster diversity typically found in a single sample). The size of each point is proportional to the number of samples taken from that ecosystem, indicating the extent of sampling. Generally, host-associated ecosystems display lower Relative Ecosystem Richness, with the exception of Plant:Rhizosphere; however, this is more similar to environmental ecosystems as it represents the interface between roots and soil rather than ecosystems contained within a host. More details can be found in Fig. S2. (b) Sensory fraction, the fraction of all proteins in a metagenome that are sensors. The fraction is shown as a percentage in the figure. Broadly speaking, we observe individual samples can approach a ceiling, indicating that scaling sensing by individual sensors has some natural limit. Microbial populations may only energetically commit to make a certain fraction of sensory proteins from their protein pool. There is also a large range seen between ecosystems from a mean of 0.05 (Human:Oral Cavity) to 0.5 (Human:Large Intestine).
Fig 4
Fig 4
Feature importance of sensory clusters explain differences between gut and mouth. (a) t-SNE plot for all metagenomes in Host-associated:Human ecosystems using the sensor profile, colored by tissue. (b) Hierarchical clustering and heatmap for large intestine and oral cavity ecosystems reveal clear groupings and a few discriminatory features in Large Intestine. (c–e) Feature importance for Oral Cavity and Large Intestine. For feature importance, the rank for a feature indicates how impactful that feature is in the ecosystem classification. This method sorts the features by the sum of importance values across all metagenomes to understand the impact a feature has on the output class. The bar chart (c) sums the absolute value for both ecosystems, while the scatter plots are for Large Intestine (d) and Oral Cavity (e). In the SHAP force plots, the Y-axis is the feature’s importance rank, the X-axis is the feature importance value, and each dot represents one metagenome. A positive importance value (X-axis) will lead the model to select for the class, and a negative value to select against the class; the color spectrum represents the value of the feature compared to other classes (red: high, blue: low abundance).
Fig 5
Fig 5
Disease classes can be predicted from the sensor profile to make meaningful insights. (a) Hierarchical clustering across normal and disease conditions in Human:Large Intestine metagenomes using sensory profiles. This dendrogram based on the similarity of the HK sensor profiles shows that samples labeled as Ademona (non-cancerous tumors) and Cancer form an outlying group and thus represent the most different samples and ecosystems in this set. Conversely, certain other ecosystems exhibited higher similarity such as normal and weight loss and type II diabetes and obesity (online supplementary file 7c). (b) A confusion matrix from the CatBoost classifier, with disease class as predicted label class. (c–e) Top features (clusters of sensory domains) in the Adenoma, Infant, and Normal gut classes. Features discussed in the text are indicated with red stars. (f) Cluster heatmap with X-axis as all gut metagenomes, Y-axis as all clusters that are correlated and anticorrelated to the sensor QseC. Acronyms in (a, b) correspond to disease classes: R Arthritis: Dysbiosis in Rheumatoid Arthritis; ETEC Chall: ETEC H10407 challenge study; Diabetes (II): type II diabetes; V. cholera: V. cholera challenge study; Obese Dis: microbial dysbiosis in young adults with obesity; Weight loss: obese patients following a weight-loss intervention; Hadza: Hadza hunter-gatherer gut microbiota; U. colitis: ulcerative colitis fecal transplant; Sym ath: symptomatic atherosclerosis; Dendrogram: Distance between sensor profiles.

References

    1. Spratt MR, Lane K. 2022. Navigating environmental transitions: the role of phenotypic variation in bacterial responses. mBio 13:e0221222. doi: 10.1128/mbio.02212-22 - DOI - PMC - PubMed
    1. Kabbara S, Hérivaux A, Dugé de Bernonville T, Courdavault V, Clastre M, Gastebois A, Osman M, Hamze M, Cock JM, Schaap P, Papon N. 2019. Diversity and evolution of sensor histidine kinases in eukaryotes. Genome Biol Evol 11:86–108. doi: 10.1093/gbe/evy213 - DOI - PMC - PubMed
    1. Bhate MP, Molnar KS, Goulian M, DeGrado WF. 2015. Signal transduction in histidine kinases: insights from new structures. Structure 23:981–994. doi: 10.1016/j.str.2015.04.002 - DOI - PMC - PubMed
    1. Tiwari S, Jamal SB, Hassan SS, Carvalho PVSD, Almeida S, Barh D, Ghosh P, Silva A, Castro TLP, Azevedo V. 2017. Two-component signal transduction systems of pathogenic bacteria as targets for antimicrobial therapy: an overview. Front Microbiol 8:1878. doi: 10.3389/fmicb.2017.01878 - DOI - PMC - PubMed
    1. Xin X, Cheng C, Du G, Chen L, Xue C. 2020. Metabolic engineering of histidine kinases in Clostridium beijerinckii for enhanced butanol production. Front Bioeng Biotechnol 8:214. doi: 10.3389/fbioe.2020.00214 - DOI - PMC - PubMed