Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 17;4(9):3451-3465.
doi: 10.1021/jacsau.4c00271. eCollection 2024 Sep 23.

Bridging Machine Learning and Thermodynamics for Accurate p K a Prediction

Affiliations

Bridging Machine Learning and Thermodynamics for Accurate p K a Prediction

Weiliang Luo et al. JACS Au. .

Abstract

Integrating scientific principles into machine learning models to enhance their predictive performance and generalizability is a central challenge in the development of AI for Science. Herein, we introduce Uni-pK a, a novel framework that successfully incorporates thermodynamic principles into machine learning modeling, achieving high-precision predictions of acid dissociation constants (pK a), a crucial task in the rational design of drugs and catalysts, as well as a modeling challenge in computational physical chemistry for small organic molecules. Uni-pK a utilizes a comprehensive free energy model to represent molecular protonation equilibria accurately. It features a structure enumerator that reconstructs molecular configurations from pK a data, coupled with a neural network that functions as a free energy predictor, ensuring high-throughput, data-driven prediction while preserving thermodynamic consistency. Employing a pretraining-finetuning strategy with both predicted and experimental pK a data, Uni-pK a not only achieves state-of-the-art accuracy in chemoinformatics but also shows comparable precision to quantum mechanics-based methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Scheme 1
Scheme 1. Schematic Overview of Uni-pKa Framework
(A) Data preparation workflow. We implement a microstate enumerator to systematically build the protonation ensemble from a single structure. (B) Pretraining workflow. Our pretraining strategy combines 1 weakly supervised task, pKa-prediction, and 3 self-supervised pretraining tasks, masked atom prediction, masked charge prediction, and 3D position recovery, to make the most use of the chemical information in 3 million microstate structures. In the pKa-prediction task, we introduce a free energy-to-pKa (FE2pKa) module to establish the relationship between the model-predicted free energy and pKa. This module also enables us to predict pKa from free energies. (C) Finetuning workflow. In this phase, we also employ the FE2pKa module, training the model using experimental pKa to enhance its capability for predicting pKa with high accuracy. (D) Inference workflow. After pretraining and finetuning, the well-trained Uni-pKa framework is equipped to handle three inference tasks, including macro-pKa prediction, micro-pKa prediction, and distribution fraction prediction.
Scheme 2
Scheme 2. Inference Stage of Uni-pKa
(A) Structures of microstates in the protonation ensemble of one reference molecule are reconstructed by the microstate generator. (B) The atom types, atomic charges, and geometry information of the microstates are fed into the Uni-Mol backbone, and the free energies are predicted for each microstate. (C) If the acid and base macrostates are specified by the user input, the macro-pKa-free-energy formula is used to transform the free energy prediction to macro-pKa prediction. If the microstates are further specified, the micro-pKa-free-energy formula is used as a special case of the macro-pKa prediction where there is only one microstate in both macrostates. (D) If pH is given by the user input, the distribution-free-energy formula is used to calculate the fraction of all the microstates in the protonation ensemble.
Figure 1
Figure 1
Uni-pKa’s concern for detailed acid–base equilibria. (A) Example of 2-hydroxybenzoic acid, where one of the dissociation is dominant. (B) Example of 2-((dimethylamino)methyl)phenol, where both reactions are dominant. (C) Uni-pKa results on SAMPL6 micro-pKa data sets involving tautomerism. (D) Thermodynamic cycle of the glycine. pKi is the dissociation equilibrium constant. The green and orange arrows indicate different protonation routes.

Similar articles

Cited by

References

    1. Wang H.; Tianfan F.; Yuanqi D.; Gao W.; Huang K.; Liu Z.; Chandak P.; Liu S.; Van Katwyk P.; Deac A.; et al. Scientific discovery in the age of artificial intelligence. Nature 2023, 620 (7972), 47–60. 10.1038/s41586-023-06221-2. - DOI - PubMed
    1. Jablonka K. M.; Ai Q.; Al-Feghali A.; Badhwar S.; Bocarsly J. D.; Bran A. M.; Stefan Bringuier L.; Brinson C.; Choudhary K.; Circi D.; et al. 14 examples of how llms can transform materials science and chemistry: a reflection on a large language model hackathon. Digital Discovery 2023, 2 (5), 1233–1250. 10.1039/D3DD00113J. - DOI - PMC - PubMed
    1. Rodrigues T. The good, the bad, and the ugly in chemical and biological data for machine learning. Drug Discovery Today: Technol. 2019, 32, 3–8. 10.1016/j.ddtec.2020.07.001. - DOI - PMC - PubMed
    1. Nandy A.; Duan C.; Kulik H. J. Audacity of huge: overcoming challenges of data scarcity and data quality for machine learning in computational materials discovery. Curr. Opin. Chem. Eng. 2022, 36, 10077810.1016/j.coche.2021.100778. - DOI
    1. Frey N. C.; Soklaski R.; Axelrod S.; Samsi S.; Gomez-Bombarelli R.; Coley C. W.; Gadepally V. Neural scaling of deep chemical models.. Nat. Mach. Intell. 2023, 5 (11), 1297–1305. 10.1038/s42256-023-00740-3. - DOI

LinkOut - more resources