Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose-Response Data

Hugo Bellamy¹, Joachim Dickhaut², Ross D King¹

Affiliations

¹ Department of Chemical engineering and biotechnology, University of Cambridge, Cambridge CB2 1TN, United Kingdom of Great Britain and Northern Ireland.
² BASF, Ludwigshafen 67056, Germany.

PMID: 40384077
PMCID: PMC12152940
DOI: 10.1021/acs.jcim.5c00249

Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose-Response Data

Hugo Bellamy et al. J Chem Inf Model. 2025.

. 2025 Jun 9;65(11):5623-5634.

doi: 10.1021/acs.jcim.5c00249. Epub 2025 May 19.

Authors

Hugo Bellamy¹, Joachim Dickhaut², Ross D King¹

Affiliations

¹ Department of Chemical engineering and biotechnology, University of Cambridge, Cambridge CB2 1TN, United Kingdom of Great Britain and Northern Ireland.
² BASF, Ludwigshafen 67056, Germany.

PMID: 40384077
PMCID: PMC12152940
DOI: 10.1021/acs.jcim.5c00249

Abstract

In early-stage drug design, machine learning models often rely on compressed representations of data, where raw experimental results are distilled into a single metric per molecule through curve fitting. This process discards valuable information about the quality of the curve fit. In this study, we incorporated a fit-quality metric into machine learning models to capture the reliability of metrics for individual molecules. Using 40 data sets from PubChem (public) and BASF (private), we demonstrated that including this quality metric can significantly improve predictive performance without additional experiments. Four methods were tested: random forests with parametric bootstrap, weighted random forests, variable output smearing random forests, and weighted support vector regression. When using fit-quality metrics, at least one of these methods led to a statistically significant improvement on 31 of the 40 data sets. In the best case, these methods led to a 22% reduction in the root-mean-squared error of the models. Overall, our results demonstrate that by adapting data processing to account for curve fit quality, we can improve predictive performance across a range of different data sets.

PubMed Disclaimer

Figures

1
(a) Reliable vs (b) unreliable fit of the Hill equation to experimental data points.

2
Graphical examples of regression and Bayesian fitting procedures.

3
Schematic comparing the standard approach to our modified testing procedure. The difference is in which values are tested: in this study we test how well we can predict experimental values rather than estimated EC50 values. The point where arrows pointing in opposite directions meet is where the evaluation metric is calculated.

4
Number of times uncertainty information caused model performance to be better, significantly better, the same and worse, than the equivalent model that did not use this information on the PubChem data sets. PB-RF, random forest with parametric bootstrap; W-RF, weighted random forest; VOS, random forest with variable output smearing; SVR, WSVR, weighted support vector regression.

5
Number of times uncertainty information caused model performance to be better, significantly better, the same and worse, than the equivalent model that did not use this information on the BASF data sets. PB-RF, random forest with parametric bootstrap; W-RF, weighted random forest; VOS, random forest with variable output smearing; SVR, WSVR, weighted support vector regression.

6
Change in root mean squared error as α is changed on data set AID 449756.

See this image and copyright information in PMC

References

1. Tropsha A.. Best practices for QSAR model development, validation, and exploitation. Molecular Informatics. 2010;29:476–488. doi: 10.1002/minf.201000061. - DOI - PubMed
1. Cherkasov A., Muratov E. N., Fourches D., Varnek A., Baskin I. I., Cronin M., Dearden J., Gramatica P., Martin Y. C., Todeschini R.. et al. QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 2014;57:4977–5010. doi: 10.1021/jm4004285. - DOI - PMC - PubMed
1. Zhang L., Fourches D., Sedykh A., Zhu H., Golbraikh A., Ekins S., Clark J., Connelly M. C., Sigal M., Hodges D.. et al. Discovery of novel antimalarial compounds enabled by QSAR-based virtual screening. J. Chem. Inf. Model. 2013;53:475–492. doi: 10.1021/ci300421n. - DOI - PMC - PubMed
1. Gomes M. N., Braga R. C., Grzelak E. M., Neves B. J., Muratov E., Ma R., Klein L. L., Cho S., Oliveira G. R., Franzblau S. G.. et al. QSAR-driven design, synthesis and discovery of potent chalcone derivatives with antitubercular activity. Eur. J. Med. Chem. 2017;137:126–138. doi: 10.1016/j.ejmech.2017.05.026. - DOI - PMC - PubMed
1. Macalino S. J. Y., Gosu V., Hong S., Choi S.. Role of computer-aided drug design in modern drug discovery. Archives of Pharmacal Research. 2015;38:1686–1701. doi: 10.1007/s12272-015-0640-5. - DOI - PubMed

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- American Chemical Society
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose-Response Data

Affiliations

Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose-Response Data

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources