. 2008 Sep 2:9:360.

doi: 10.1186/1471-2105-9-360.

The C1C2: a framework for simultaneous model selection and assessment

Martin Eklund¹, Ola Spjuth, Jarl Es Wikberg

Affiliations

PMID: 18761753
PMCID: PMC2556350
DOI: 10.1186/1471-2105-9-360

The C1C2: a framework for simultaneous model selection and assessment

Martin Eklund et al. BMC Bioinformatics. 2008.

. 2008 Sep 2:9:360.

doi: 10.1186/1471-2105-9-360.

Authors

Martin Eklund¹, Ola Spjuth, Jarl Es Wikberg

Affiliation

¹ Department of Pharmaceutical Pharmacology, Uppsala University, Box 591, BMC, SE-751 24 Uppsala, Sweden. martin.eklund@farmbio.uu.se

PMID: 18761753
PMCID: PMC2556350
DOI: 10.1186/1471-2105-9-360

Abstract

Background: There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment. Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data. Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets. A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates. The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model.

Results: The C1C2 framework performed well at finding the true model in terms of choosing the correct variable subset and producing reasonable choices for the penalizing parameter, even in situations when the independent variables were highly correlated and when the number of observations was less than the number of variables. The C1C2 framework was also found to give accurate estimates of the generalization error. Prior information about the number of important independent variables improved the variable subset choice but reduced the accuracy of generalization error estimates. Using the genetic algorithm worsened the model choice but not the generalization error estimates, compared to using the brute-force method. The results obtained with repeated K-fold cross-validation were similar to those produced by the C1C2 in terms of model choice, however a lower accuracy of the generalization error estimates was observed.

Conclusion: The C1C2 framework was demonstrated to work well for finding the true model within a penalized linear model class and accurately assess its generalization error, even for datasets with many highly correlated independent variables, a low observation-to-variable ratio, and model assumption deviations. A complete separation of the model choice and the model assessment in terms of data used for each task improves the estimates of the generalization error.

PubMed Disclaimer

Figures

**Figure 1**
**The C¹C²framework**. The data partitioning in step (a) in the pseudocode separates the model choice from its assessment, which is highlighted in purple in the figure. The left side of the figure relates to steps (b) to (d) in the pseudocode, and the right side to step (e); i.e. the left side relates to choosing the model and saving the parameter estimates, and the right side to assessing the model and saving the assessment results.

**Figure 3**
**Cluster dendrogram of the 14 selected variables from the Selwood dataset using repeated K-fold cross-validation**. Three distinct clusters can be noted (shown in red, green, and yellow rectangles). One sub-cluster can be seen within the red cluster (shown in a blue rectangle). The red and green numbers are p-values of a given cluster; they indicate how well the cluster is supported by data (see [31] for details). ⁺Additional variables selected by repeated K-fold cross-validation compared to the C¹C².

See this image and copyright information in PMC

Cited by

Towards interoperable and reproducible QSAR analyses: Exchange of datasets.
Spjuth O, Willighagen EL, Guha R, Eklund M, Wikberg JE. Spjuth O, et al. J Cheminform. 2010 Jun 30;2(1):5. doi: 10.1186/1758-2946-2-5. J Cheminform. 2010. PMID: 20591161 Free PMC article.
QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality.
Wood DJ, Carlsson L, Eklund M, Norinder U, Stålring J. Wood DJ, et al. J Comput Aided Mol Des. 2013 Mar;27(3):203-19. doi: 10.1007/s10822-013-9639-5. Epub 2013 Mar 16. J Comput Aided Mol Des. 2013. PMID: 23504478 Free PMC article.
Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation.
Baumann D, Baumann K. Baumann D, et al. J Cheminform. 2014 Nov 26;6(1):47. doi: 10.1186/s13321-014-0047-1. eCollection 2014. J Cheminform. 2014. PMID: 25506400 Free PMC article.
RRegrs: an R package for computer-aided model selection with multiple regression models.
Tsiliki G, Munteanu CR, Seoane JA, Fernandez-Lozano C, Sarimveis H, Willighagen EL. Tsiliki G, et al. J Cheminform. 2015 Sep 15;7:46. doi: 10.1186/s13321-015-0094-2. eCollection 2015. J Cheminform. 2015. PMID: 26379782 Free PMC article.

References

1. Kontijevskis A, Prusis P, Petrovska R, Yahorava S, Mutulis F, Mutule I, Komorowski J, Wikberg JES. A look inside HIV resistance through retroviral protease interaction maps. PloS Computational Biology. 2007;3 - PMC - PubMed
1. Wikberg JES, Lapinsh M, Prusis P. Proteochemometrics: A tool for modelling the molecular interaction space. In: Kubinyi H, Müller G, editor. Chemogenomics in Drug Discovery - A Medicinal Chemistry Perspective. Weinheim , Wiley-VCH; 2004. pp. 289–309.
1. Hansch C. A Quantitative Approach to Biochemical Structure-Activity Relationships. Accounts of Chemical Research. 1969;2:232–239. doi: 10.1021/ar50020a002. - DOI
1. Hvidsten TR, Wilczynski B, Kryshtafovych A, Tiuryn J, Komorowski J, Fidelis K. Genome Res. 2005/06/03. Vol. 15. 2005. Discovering regulatory binding-site modules using rule-based learning; pp. 856–866. - DOI - PMC - PubMed
1. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The C1C2: a framework for simultaneous model selection and assessment

Affiliation

The C1C2: a framework for simultaneous model selection and assessment

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials