Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Sep 2:9:360.
doi: 10.1186/1471-2105-9-360.

The C1C2: a framework for simultaneous model selection and assessment

Affiliations

The C1C2: a framework for simultaneous model selection and assessment

Martin Eklund et al. BMC Bioinformatics. .

Abstract

Background: There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment. Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data. Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets. A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates. The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model.

Results: The C1C2 framework performed well at finding the true model in terms of choosing the correct variable subset and producing reasonable choices for the penalizing parameter, even in situations when the independent variables were highly correlated and when the number of observations was less than the number of variables. The C1C2 framework was also found to give accurate estimates of the generalization error. Prior information about the number of important independent variables improved the variable subset choice but reduced the accuracy of generalization error estimates. Using the genetic algorithm worsened the model choice but not the generalization error estimates, compared to using the brute-force method. The results obtained with repeated K-fold cross-validation were similar to those produced by the C1C2 in terms of model choice, however a lower accuracy of the generalization error estimates was observed.

Conclusion: The C1C2 framework was demonstrated to work well for finding the true model within a penalized linear model class and accurately assess its generalization error, even for datasets with many highly correlated independent variables, a low observation-to-variable ratio, and model assumption deviations. A complete separation of the model choice and the model assessment in terms of data used for each task improves the estimates of the generalization error.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The C1C2 framework. The data partitioning in step (a) in the pseudocode separates the model choice from its assessment, which is highlighted in purple in the figure. The left side of the figure relates to steps (b) to (d) in the pseudocode, and the right side to step (e); i.e. the left side relates to choosing the model and saving the parameter estimates, and the right side to assessing the model and saving the assessment results.
Figure 2
Figure 2
Generalization errors obtained with the C1C2 and repeated K-fold cross-validation. The figure shows |ε^genε˜gen|¯, where ε^gen were produced using the C1C2 (blue) and repeated K-fold cross-validation (red) for all other factor combinations in model (6). The plot is based on pooled |ε^genε˜gen|¯ over the four replicates for each method. The bars show the 95% confidence interval, calculated from the pooled results (the confidence intervals are only shown in one direction to avoid cluttering). The factor combinations in model (6) are coded as: ga – the GA search method was used, bf – the brute force search method was used, uncor – orthogonal independent variables in the dataset, cor – correlated independent variables in the dataset, 15 n = 15 observations in the dataset, 200 n = 200 observations in the dataset, all – no assumption regarding the number of nonzero δi, 3 – three δi = 1 were assumed.
Figure 3
Figure 3
Cluster dendrogram of the 14 selected variables from the Selwood dataset using repeated K-fold cross-validation. Three distinct clusters can be noted (shown in red, green, and yellow rectangles). One sub-cluster can be seen within the red cluster (shown in a blue rectangle). The red and green numbers are p-values of a given cluster; they indicate how well the cluster is supported by data (see [31] for details). +Additional variables selected by repeated K-fold cross-validation compared to the C1C2.

Similar articles

Cited by

References

    1. Kontijevskis A, Prusis P, Petrovska R, Yahorava S, Mutulis F, Mutule I, Komorowski J, Wikberg JES. A look inside HIV resistance through retroviral protease interaction maps. PloS Computational Biology. 2007;3 - PMC - PubMed
    1. Wikberg JES, Lapinsh M, Prusis P. Proteochemometrics: A tool for modelling the molecular interaction space. In: Kubinyi H, Müller G, editor. Chemogenomics in Drug Discovery - A Medicinal Chemistry Perspective. Weinheim , Wiley-VCH; 2004. pp. 289–309.
    1. Hansch C. A Quantitative Approach to Biochemical Structure-Activity Relationships. Accounts of Chemical Research. 1969;2:232–239. doi: 10.1021/ar50020a002. - DOI
    1. Hvidsten TR, Wilczynski B, Kryshtafovych A, Tiuryn J, Komorowski J, Fidelis K. Genome Res. 2005/06/03. Vol. 15. 2005. Discovering regulatory binding-site modules using rule-based learning; pp. 856–866. - DOI - PMC - PubMed
    1. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. - DOI - PubMed

Publication types