Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 May-Jun;43(3):964-9.
doi: 10.1021/ci020377j.

VSMP: a novel variable selection and modeling method based on the prediction

Affiliations

VSMP: a novel variable selection and modeling method based on the prediction

Shu-Shen Liu et al. J Chem Inf Comput Sci. 2003 May-Jun.

Abstract

The use of numerous descriptors that are indicative of molecular structure and topology is becoming more common in quantitative structure-activity relationship (QSAR). How to choose the adequate descriptors for QSAR studies is important but difficult because there are no absolute rules to govern this choice. A variety of variable selection techniques including stepwise, partial least squares/principal component analysis (PLS/PCA), neural network, and evolutionary algorithm such as genetic algorithm have been applied to this common problem. All-subsets regression (ASR) is capable of finding out the best variable subset from among a large pool. In this paper, a novel variable selection and modeling method based on the prediction, for short VSMP, has been developed. Here two controllable parameters, the interrelation coefficient between the pairs of the independent variables (r(int)) and the correlation coefficient (q(2)) obtained using the leave-one-out (LOO) cross-validation technique, are introduced into the ASR to improve its performances. This technique differs from the other variable selection procedures related to the ASR by two main features: (1) The search of various optimal subset search is controlled by the statistic q(2) or root-mean-square error (RMSEP) in the LOO cross-validation step rather than the correlation coefficient obtained in the modeling step (r(2)). (2) The searching speed of all optimal subsets is expedited by the statistic r(int) together with q(2). A comparison of the results of the VSMP applied to the Selwood data set (n = 31 compounds, m = 53 descriptors) with those obtained from alternative algorithms shows the good performance of the technique.

PubMed Disclaimer

LinkOut - more resources