Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 May 21;21(3):791-802.
doi: 10.1093/bib/bbz026.

Validation strategies for target prediction methods

Affiliations
Review

Validation strategies for target prediction methods

Neann Mathai et al. Brief Bioinform. .

Abstract

Computational methods for target prediction, based on molecular similarity and network-based approaches, machine learning, docking and others, have evolved as valuable and powerful tools to aid the challenging task of mode of action identification for bioactive small molecules such as drugs and drug-like compounds. Critical to discerning the scope and limitations of a target prediction method is understanding how its performance was evaluated and reported. Ideally, large-scale prospective experiments are conducted to validate the performance of a model; however, this expensive and time-consuming endeavor is often not feasible. Therefore, to estimate the predictive power of a method, statistical validation based on retrospective knowledge is commonly used. There are multiple statistical validation techniques that vary in rigor. In this review we discuss the validation strategies employed, highlighting the usefulness and constraints of the validation schemes and metrics that are employed to measure and describe performance. We address the limitations of measuring only generalized performance, given that the underlying bioactivity and structural data are biased towards certain small-molecule scaffolds and target families, and suggest additional aspects of performance to consider in order to produce more detailed and realistic estimates of predictive power. Finally, we describe the validation strategies that were employed by some of the most thoroughly validated and accessible target prediction methods.

Keywords: classification; data bias; model validation; performance metrics; polypharmacology; target prediction.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustrations of example data partitioning schemes: (A) a single train–test split, (B) a single train–test split of chronological data, (C) a 5-fold CV scheme, (D) a single train–test split into construction and validation sets for internal validation and an external testing set for external validation, (E) a 4-fold CV scheme used for internal validation with a testing set reserved for external validation and (F) a nested CV scheme with a 2-fold loop for internal validation and a 3-fold loop for external validation.
Figure 2
Figure 2
Examples of CV-testing folds designed to have (A) all data points involving specific queries within 1-fold (points inside the purple box), (B) all data points involving specific targets within 1-fold (points inside the purple box) and (C) all data points involving the components of query compounds–target pairs within one testing fold (points inside the purple boxes). The data points covered by the blue boxes are omitted from both training and testing data during the CV round involving the purple boxed data as the testing set, and the remaining data points are used as the training set. Interacting pairs are shown in green while (putative) non-interacting pairs are shown in white (adapted from Pahikkala et al. [25]).
Figure 3
Figure 3
(A) A binary classification confusion matrix with the four categories of prediction (FPs may include putative false positives); (B) ROC curves: the closer the curves are to the top left-hand corner, the better. AUC values alone may be deceptive as a lack of correct early predictions may be offset by an increased number of correct predictions later, leading to high AUC values. This scenario is shown by the green and purple curves. (C) Precision-recall curve: the closer the curve is to the top right corner, the better the model’s performance.
Figure 4
Figure 4
Success rates for a target prediction model (e.g. percentage of compounds for which at least one known target was ranked among the top 1, top 3 and top 5 positions) versus the maximum similarity between the individual query compounds and their closest related compounds in the reference data. Such plots are powerful tools to visualize a method’s capacity for inter- and extrapolation and help with the definition of the applicability domain.

Similar articles

Cited by

References

    1. Moffat JG, Vincent F, Lee JA, et al. . Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nat Rev Drug Discov 2017;16:531–543. - PubMed
    1. Chaudhari R, Tan Z, Huang B, et al. . Computational polypharmacology: a new paradigm for drug discovery. Expert Opin Drug Discov 2017;12:279–291. - PMC - PubMed
    1. Reddy AS, Zhang S. Polypharmacology: drug discovery for the future. Expert Rev Clin Pharmacol 2013;6:41–47. - PMC - PubMed
    1. Anighoro A, Bajorath J, Rastelli G. Polypharmacology: challenges and opportunities in drug discovery. J Med Chem 2014;57:7874–7887. - PubMed
    1. Proschak E, Stark H, Merk D. Polypharmacology by design: a medicinal chemist’s perspective on multitargeting compounds. J Med Chem 2019;62:420–444. - PubMed

Publication types

MeSH terms

Substances