Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 May-Jun;16(5-6):357-69.
doi: 10.1023/a:1020869118689.

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

Affiliations

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

Alexander Golbraikh et al. J Comput Aided Mol Des. 2002 May-Jun.

Abstract

One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.

PubMed Disclaimer

References

    1. J Chem Inf Comput Sci. 2001 Sep-Oct;41(5):1218-27 - PubMed
    1. J Med Chem. 1998 Sep 10;41(19):3609-23 - PubMed
    1. J Chem Inf Comput Sci. 2001 Jan-Feb;41(1):147-58 - PubMed
    1. Eur J Med Chem. 2001 Jan;36(1):1-19 - PubMed
    1. J Chem Inf Comput Sci. 2000 Jan;40(1):185-94 - PubMed

MeSH terms

Substances

LinkOut - more resources