Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 29;7(1):1709.
doi: 10.1038/s42003-024-07436-3.

Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model

Affiliations

Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model

Mingming Zhu et al. Commun Biol. .

Abstract

Proteins derived from microorganisms that survive in the harshest environments on Earth have stable activity under extreme conditions, providing rich resources for industrial applications and enzyme engineering. Due to the time-consuming nature of experimental determinations, it is imperative to develop computational models for fast and accurate prediction of protein optimal conditions. Previous studies were limited by the scarcity of data and the neglect of protein structures. To solve these problems, we constructed an up-to-date dataset with 175,905 non-redundant proteins and proposed a new model GeoPoc based on geometric graph learning for the protein optimal temperature, pH, and salt concentration prediction. GeoPoc leverages protein structures and sequence embeddings extracted from pre-trained language model, and further employs a geometric graph transformer network to capture the sequence and spatial information. We first focused on in-house validation for optimal temperature prediction for robustness assessment, and achieved a PCC of 0.78. The algorithm is further confirmed in an independent test set, where GeoPoc surpasses the state-of-the-art method by 2.3% in AUC. Additionally, GeoPoc was extended to pH and salt concentration prediction, and obtained AUC scores of 0.78 and 0.77, respectively. Through further interpretable analysis, GeoPoc elucidates the critical physicochemical properties that contribute to enhancing protein thermostability.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Y.Y. is an Editorial Board Member for Communications Biology, but was not involved in the editorial review of, nor the decision to publish this article. All the other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of dataset preparation, and GeoPoc model architecture.
a The data collection and dataset preparation process. b The overall architecture of the GeoPoc model. ESM2.0 is used to extract the sequence embedding from the sequence, and the protein structure is taken from the AlphaFold2 database. After featuring these as protein graphs, the graph is input to the GeoFormer module to get hidden embeddings. Finally, the hidden embeddings are pooled by the self-attention pooling layer, which is input to the output MLP to predict the temperature, pH, and salt concentration. Note: SaltConc denotes salt concentration.
Fig. 2
Fig. 2. Distribution of the optimal growth conditions for species and proteins across different condition ranges in three data sets.
a Optimal temperature distribution for species and proteins, with a temperature range from 4 °C to 105 °C. b Optimal pH distribution, ranging from 1.1 to 12. c Optimal salt concentration distribution, ranging from 0% to 37%. In each figure, blue bars represent species, and yellow bars represent proteins. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. The performance of GeoPoc and other models on protein optimal temperature prediction.
The regression correlation between prediction and ground truth values of optimal temperature for all proteins in 5-fold CV (a) and test (b). For the color bar in the (a)/(b), we used a density-based color scale computed via Gaussian kde, and the square root transformation was applied to enhance visual clarity in regions with high data density. The MAE for the 5-fold CV is 6.402 ± 0.116 (a), while the MAE for the test set is 6.083 (b). c Performance comparison between GeoPoc and ablation methods in the test set using PCC. d Receiver Operating Characteristic curves of GeoPoc and comparison methods on the independent test set. e Comparison of GeoPoc and comparison methods on threshold-dependent metrics. f The regression correlation between ground truth and predicted values of optimal temperature (°C) for all species in the test set. Note: w/o denotes without. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. The performance of GeoPoc and baselines on protein optimal pH and salt concentration prediction.
Receiver Operating Characteristic curves of GeoPoc and the ablation methods of GeoPoc on the pH (a) and salt concentration (b) test sets. Sankey diagram visualizing the flow between ground truth and prediction of GeoPoc (ce) and GeoPoc (w/o geometric features) (df) on the pH and salt concentration test sets. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Visualization of residue importance and conservation and aligned protein families.
a Protein secondary structure with the color-coded importance level of residues in one example (UID: M0HVG5). The color indicates importance, where red represents higher importance, and blue indicates lower importance. b The amino acid’s frequency and the average importance in the thermophilic proteins. c Plot the A0A1I7KCI7 protein residue conservation score and its importance for each position in the aligned protein families (DNA_helicase_UvrDPEP).

Similar articles

References

    1. Stetter, K. O. Extremophiles and their adaptation to hot environments. FEBS Lett.452, 22–25 (1999). - PubMed
    1. Dumorné, K., Córdova, D. C., Astorga-Eló, M. & Renganathan, P. Extremozymes: a potential source for industrial applications J. Microbiol. Biothechnol. 27, 649–659 (2017). - PubMed
    1. Cowan, D. A., Ramond, J.-B., Makhalanyane, T. P. & De Maayer, P. Metagenomics of extreme environments. Curr. Opin. Microbiol.25, 97–102 (2015). - PubMed
    1. Fujiwara, S. Extremophiles: Developments of their special functions and potential resources. J. Biosci. Bioeng.94, 518–525 (2002). - PubMed
    1. Brininger, C., Spradlin, S., Cobani, L. & Evilia, C. The more adaptive to change, the more likely you are to survive: protein adaptation in extremophiles. In Seminars In Cell & Developmental Biology (ed. Mao, Y.) 158–169 (Elsevier, 2018). - PubMed

LinkOut - more resources