Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening

Qurrat Ul Ain¹, Antoniya Aleksandrova², Florian D Roessler¹, Pedro J Ballester³

Affiliations

¹ Department of Chemistry, Centre for Molecular Informatics University of Cambridge Cambridge UK.
² Cavendish Laboratory University of Cambridge Cambridge UK.
³ Cancer Research Center of Marseille, (INSERM U1068, Institut Paoli-Calmettes, Aix-Marseille Université, CNRS UMR7258) Marseille France.

PMID: 27110292
PMCID: PMC4832270
DOI: 10.1002/wcms.1225

Review

Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening

Qurrat Ul Ain et al. Wiley Interdiscip Rev Comput Mol Sci. 2015 Nov-Dec.

. 2015 Nov-Dec;5(6):405-424.

doi: 10.1002/wcms.1225. Epub 2015 Aug 28.

Authors

Qurrat Ul Ain¹, Antoniya Aleksandrova², Florian D Roessler¹, Pedro J Ballester³

Affiliations

¹ Department of Chemistry, Centre for Molecular Informatics University of Cambridge Cambridge UK.
² Cavendish Laboratory University of Cambridge Cambridge UK.
³ Cancer Research Center of Marseille, (INSERM U1068, Institut Paoli-Calmettes, Aix-Marseille Université, CNRS UMR7258) Marseille France.

PMID: 27110292
PMCID: PMC4832270
DOI: 10.1002/wcms.1225

Abstract

Docking tools to predict whether and how a small molecule binds to a target can be applied if a structural model of such target is available. The reliability of docking depends, however, on the accuracy of the adopted scoring function (SF). Despite intense research over the years, improving the accuracy of SFs for structure-based binding affinity prediction or virtual screening has proven to be a challenging task for any class of method. New SFs based on modern machine-learning regression models, which do not impose a predetermined functional form and thus are able to exploit effectively much larger amounts of experimental data, have recently been introduced. These machine-learning SFs have been shown to outperform a wide range of classical SFs at both binding affinity prediction and virtual screening. The emerging picture from these studies is that the classical approach of using linear regression with a small number of expert-selected structural features can be strongly improved by a machine-learning approach based on nonlinear regression allied with comprehensive data-driven feature selection. Furthermore, the performance of classical SFs does not grow with larger training datasets and hence this performance gap is expected to widen as more training data becomes available in the future. Other topics covered in this review include predicting the reliability of a SF on a particular target class, generating synthetic data to improve predictive performance and modeling guidelines for SF development. WIREs Comput Mol Sci 2015, 5:405-424. doi: 10.1002/wcms.1225 For further resources related to this article, please visit the WIREs website.

PubMed Disclaimer

Figures

**Figure 1**
Examples of force‐field, knowledge‐based, empirical, and machine‐learning scoring functions (SFs). The first three types, collectively termed classical SFs, are distinguished by the type of structural descriptors employed. However, from a mathematical perspective, all classical SFs assume an additive functional form. By contrast, nonparametric machine‐learning SFs do not make assumptions about the form of the functional. Instead, the functional form is inferred from training data in an unbiased manner. As a result, classical and machine‐learning SFs behave very differently in practice.20

**Figure 2**
Criteria to select data to build and validate scoring functions (SFs). Protein‐ligand complexes can be selected by their quality, protein‐family membership as well as type of structural and binding data depending on intended docking application and modeling strategy. Classical SFs typically employ a few hundred x‐ray crystal structures of the highest quality along with their binding constants to score complexes with proteins from any family. In contrast, data selection for machine‐learning SFs is much more varied, with the largest training data volumes leading to the best performances.

**Figure 3**
Workflow to train and validate a scoring function (SF). Feature Selection (FS) can be data‐driven or expert‐based (for simplicity, we are not representing embedded FS that would take place at the model training stage). A range of machine‐learning regression or classification models can be used for training, whereas linear regression has been used with classical SFs. Model selection has ranged from taking the best model on the training set to selecting that with the best cross‐validated performance. Metrics for model selection and performance evaluation depend on the application.

**Figure 4**
Blind test showing how test set performance (*R_p*) grows with more training data when using random forest (models 3 and 4), but stagnates with multiple linear regression (model 2). Model 1 is AutoDock Vina acting as a baseline for performance.

See this image and copyright information in PMC

References

1. Schneider G. Virtual screening: an endless staircase? Nat Rev Drug Discov 2010, 9:273–276. - PubMed
1. Vasudevan SR, Churchill GC. Mining free compound databases to identify candidates selected by virtual screening. Expert Opin Drug Discov 2009, 4:901–906. - PubMed
1. Villoutreix BO, Renault N, Lagorce D, Sperandio O, Montes M, Miteva MA. Free resources to assist structure‐based virtual ligand screening experiments. Curr Protein Pept Sci 2007, 8:381–411. - PubMed
1. Xing L, McDonald JJ, Kolodziej SA, Kurumbail RG, Williams JM, Warren CJ, O'Neal JM, Skepner JE, Roberds SL. Discovery of potent inhibitors of soluble epoxide hydrolase by combinatorial library design and structure‐based virtual screening. J Med Chem 2011, 54:1211–1222. - PubMed
1. Hermann JC, Marti‐Arbona R, Fedorov AA, Fedorov E, Almo SC, Shoichet BK, Raushel FM. Structure‐based activity prediction for an enzyme of unknown function. Nature 2007, 448:775–779. - PMC - PubMed

Publication types

Actions

Grants and funding

G0902106/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening

Affiliations

Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous