. 2025 May 1;26(3):bbaf212.

doi: 10.1093/bib/bbaf212.

NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling

Jingchen Zhai¹, Xiguang Qi¹, Lianjin Cai¹, Yue Liu¹, Haocheng Tang¹, Lei Xie^{2

3}, Junmei Wang¹

Affiliations

¹ Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy, University of Pittsburgh, 3501 Terrace St, Pittsburgh, PA 15261, United States.
² Department of Computer Science, Hunter College, The City University of New York, 695 Park Ave, New York, NY 10065, United States.
³ Helen & Robert Appel Alzheimer's Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, 413 E 69th St, New York, NY 10021, United States.

PMID: 40370097
PMCID: PMC12078937
DOI: 10.1093/bib/bbaf212

NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling

Jingchen Zhai et al. Brief Bioinform. 2025.

. 2025 May 1;26(3):bbaf212.

doi: 10.1093/bib/bbaf212.

Authors

Jingchen Zhai¹, Xiguang Qi¹, Lianjin Cai¹, Yue Liu¹, Haocheng Tang¹, Lei Xie^{2

3}, Junmei Wang¹

Affiliations

¹ Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy, University of Pittsburgh, 3501 Terrace St, Pittsburgh, PA 15261, United States.
² Department of Computer Science, Hunter College, The City University of New York, 695 Park Ave, New York, NY 10065, United States.
³ Helen & Robert Appel Alzheimer's Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, 413 E 69th St, New York, NY 10021, United States.

PMID: 40370097
PMCID: PMC12078937
DOI: 10.1093/bib/bbaf212

Abstract

Catalytic constant (Kcat) is to describe the efficiency of catalyzing reactions. The Kcat value of an enzyme-substrate pair indicates the rate an enzyme converts saturated substrates into product during the catalytic process. However, it is challenging to construct robust prediction models for this important property. Most of the existing models, including the one recently published by Nature Catalysis (Li et al.), are suffering from the overfitting issue. In this study, we proposed a novel protocol to construct Kcat prediction models, introducing an intermedia step to separately develop substrate and protein processors. The substrate processor leverages analyzing Simplified Molecular Input Line Entry System (SMILES) strings using a graph neural network model, attentive FP, while the protein processor abstracts protein sequence information utilizing long short-term memory architecture. This protocol not only mitigates the impact of data imbalance in the original dataset but also provides greater flexibility in customizing the general-purpose Kcat prediction model to enhance the prediction accuracy for specific enzyme classes. Our general-purpose Kcat prediction model demonstrates significantly enhanced stability and slightly better accuracy (R2 value of 0.54 versus 0.50) in comparison with Li et al.'s model using the same dataset. Additionally, our modeling protocol enables personalization of fine-tuning the general-purpose Kcat model for specific enzyme categories through focused learning. Using Cytochrome P450 (CYP450) enzymes as a case study, we achieved the best R2 value of 0.64 for the focused model. The high-quality performance and expandability of the model guarantee its broad applications in enzyme engineering and drug research & development.

Keywords: Kcat; data imbalance; deep neural network; enzyme turnover number; focused learning; machine learning.

PubMed Disclaimer

Figures

**Figure 1**
A flowchart highlights the key components of Kcat model development in this work. We first constructed the substrate processor and protein processor separately utilizing Dataset #1 and then conducted feature augmentation with both processors. Next, all the generated feature embeddings are combined to train the general-purpose Kcat prediction models. Three parallel experiments are conducted during the training and testing the general-purpose Kcat models. Last, Dataset #2A was further applied to objectively evaluate the general-purpose Kcat models.

**Figure 2**
The distributions of Dataset #1 and Dataset #2A. Left: Distributions of substrate molecular weights in two datasets. Right: Distributions of amino acid sequence lengths of the proteins. The scatter and half-violin plots display the frequency distribution for each data group. Black dots in the halfviolin plots represent the mean values for these groups. The box plots illustrate the central tendency of the data, highlighting the medians and quartiles for both groups.

**Figure 3**
The performance of the substrate processor model. (A) The changing RMSE for the training and test sets during the model training process. The marked numbers are RMSE values for the best model (epoch 10). (B) The performance on the training set of the best model (epoch 10). (C) The model performance on the test set of the best model (epoch 10).

**Figure 4**
The performance of the protein processor. (A) The RMSE changing for the training and test sets during the model training process. The marked numbers are RMSE values for the best model (epoch 27). (B) The performance on the training set of the best model (epoch 27). (C) The performance on the test set of the best model (epoch 27).

**Figure 5**
The model performance, measured by correlation coefficient square – R2 (left panel) and rootmean-square errors-RMSE (right panel) of top machine learning models. Different random numbers were applied to divide dataset #1 into training and test sets. Higher R2 indicates better correlation between predicted log_2 K_cat values and the experimental ones in dataset #1. Lower RMSE value indicates lower prediction error of log_2 K_cat. GPR: Gaussian process regression; NN: Neural network; SVM: Support vector machine; tree: Decision tree.

formula image — **Figure 5**
The model performance, measured by correlation coefficient square – R2 (left panel) and rootmean-square errors-RMSE (right panel) of top machine learning models. Different random numbers were applied to divide dataset #1 into training and test sets. Higher R2 indicates better correlation between predicted log_2 K_cat values and the experimental ones in dataset #1. Lower RMSE value indicates lower prediction error of log_2 K_cat. GPR: Gaussian process regression; NN: Neural network; SVM: Support vector machine; tree: Decision tree.

**Figure 6**
The test set performance of the models generated under three random splits on the dataset #1. For a comparison purpose, the performance of the Li *et al.’s* model constructed using the same dataset is listed as follows: Random number 1357: R²= 0.203; random number 1234: R² = 0.516; random number 0103: R² = 0.543. Note that we reproduced Li *et al.’s* model using the code they provided in GitHub.

**Figure 7**
Similarity between substrates and protein sequences from dataset #2A and dataset #1. The sequence similarity is calculated by MUSCLE software.

**Figure 8**
Illustration of sequence difference between Dataset #1 and Dataset #2A in three different scenarios. Top: Scenario 1; bottom left: Scenario 2; bottom right: Scenario 3.

**Figure 9**
The performance of the three models on Dataset #2A for external validation. Sequences length ranging from 174 to 1074 amino acids are represented in different colors according to figure legend. The dash line represents the Y = X trendline. A total of 122 records from groups a and B, which include new sequence inputs, are represented by circles (●). Records from group C, where all sequences have exact matches in dataset #1, are depicted as triangles (▲).

**Figure 10**
Application of three general-purpose models on Dataset #2A for records in group A and B. The sequence identity of a sequence in Dataset #2A and it most similar sequence in Dataset #1 was colored according to the figure legend. The sequence identity is from 34.6% to 100%.

**Figure 11**
The performance of focused learning models from three parallel experiments. Each column represents an individual experiment. Panels A, B, and C display the distributions of K_cat values for the sublibrary which was randomly split into eight groups. Panels D, E, and F show the model performance under different data splitting conditions, with the axis label on the left for RMSE and the axis label on the right for R². Each individual RMSE and R² value reflects the model performance when a specific group is used as the validation set in the leave-one-group-out approach. The average RMSE and R² values represent the summary statistics when each group is sequentially used as the validation set.

See this image and copyright information in PMC

References

1. Koshland DE Jr. The application and usefulness of the ratio kcat/KM. Bioorg Chem 2002;30:211–3. 10.1006/bioo.2002.1246 - DOI - PubMed
1. Eisenthal R, Danson MJ, Hough DW. Catalytic efficiency and kcat/KM: a useful comparator? Trends Biotechnol 2007;25:247–9. 10.1016/j.tibtech.2007.03.010 - DOI - PubMed
1. Lorsch JR. Methods in Enzymology. Laboratory methods in enzymology: protein part A. Preface Methods Enzymol 2014;536:xv. 10.1016/B978-0-12-420070-8.09988-8 - DOI - PubMed
1. Carrillo N, Ceccarelli E, Roveri O. Usefulness of kinetic enzyme parameters in biotechnological practice. Biotechnol Genet Eng Rev 2010;27:367–82. 10.1080/02648725.2010.10648157 - DOI - PubMed
1. Yu H, Deng H, He J. et al. UniKP: a unified framework for the prediction of enzyme kinetic parameters. Nat Commun 2023;14:8211. 10.1038/s41467-023-44113-1 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01AG057555/National Science Foundation

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling

Affiliations

NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling

Authors

Affiliations

Abstract

Figures

Similar articles

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous