Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec 15;27(24):3399-406.
doi: 10.1093/bioinformatics/btr591.

Optimized application of penalized regression methods to diverse genomic data

Affiliations

Optimized application of penalized regression methods to diverse genomic data

Levi Waldron et al. Bioinformatics. .

Abstract

Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has not been determined.

Results: Through simulation of contrasting scenarios of correlated high-dimensional survival data, we compared the LASSO, Ridge and Elastic Net penalties for prediction and variable selection. We found that a 2D tuning of the Elastic Net penalties was necessary to avoid mimicking the performance of LASSO or Ridge regression. Furthermore, we found that in a simulated scenario favoring the LASSO penalty, a univariate pre-filter made the Elastic Net behave more like Ridge regression, which was detrimental to prediction performance. We demonstrate the real-life application of these methods to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Based on these results, we provide an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data.

Availability and implementation: A parallelized implementation of the methods presented for regression and for simulation of synthetic data is provided as the pensim R package, available at http://cran.r-project.org/web/packages/pensim/index.html.

Contact: chuttenh@hsph.harvard.edu; juris@ai.utoronto.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(A) Methodology for model selection and validation of high-dimensional data. Objectives include both feature selection and outcome prediction, e.g. for patient survival given tumor gene expression data. A nearly unbiased assessment of prediction accuracy for small samples sizes is obtained by repeating all steps of model selection in each iteration of the cross-validation. Variable selection and model conditioning are achieved within the training sets by an optional, permissive univariate pre-filter followed by repeated cross-validation for parameter tuning. These steps are detailed in Section 4. (B) Over-fitting occurs in spite of tuning the models by cross-validation, as evidenced by reduced prediction accuracy in simulated test sets compared to resubstitution of training data.
Fig. 2.
Fig. 2.
Optimized values of the Elastic Net tuning parameters in simulated scenarios favoring the LASSO and the Ridge penalties, with comparison to LASSO and Ridge regression. Selected values of the Elastic Net tuning parameters depend on the nature of the problem at hand (left half versus right half), whether pre-filtering precedes the tuning (inner left versus right), and the tuning strategy (represented in each row). In both scenarios, with and without pre-filtering, sequential tuning of the Elastic Net penalties (λ1−λ2 and λ2−λ1 methods) was dominated by the first penalty tuned, as evidenced by the similarity of values of that penalty to the Ridge or LASSO penalty shown in the adjacent histogram, and smaller values of the second penalty tuned compared with the other single-penalty regression or sequential-tuning regression. Assessment of model prediction and precision of variable selection are correspondingly similar for these methods (Fig. 3 and Supplementary Fig. S3). Note the different y-axis scale for λ2−λ1 Elastic Net and Ridge regression in the LASSO-favoring scenario. Application of a univariate pre-filter reduced the relative influence of the λ1 penalty, particularly in the LASSO-favoring scenario (for example, 9 versus 10). Univariate pre-filtering (P<0.1) reduced the tuned values of all penalty parameters and, in particular, reduced the influence of λ1 relative to λ2 in the λ12 Elastic Net (panels 9 versus 10 and 19 versus 20). These results show that sequential tuning of the λ1 penalty Elastic Net (λ2−λ1 and λ1−λ2 methods) is not adequate to enjoy any benefit over LASSO and Ridge regression, and that even in a problem where the λ1 penalty is preferred, application of a univariate pre-filter causes the λ2 penalty to dominate the λ12 Elastic Net.
Fig. 3.
Fig. 3.
Ranking of methods for prediction accuracy in two scenarios, simulated to favor (A) the LASSO penalty and (B) the Ridge penalty. In both scenarios and with all levels of pre-filtering, sequential tuning of the Elastic Net is dominated by the first penalty tuned (λ1−λ2 is similar to LASSO, and λ1−λ2 is similar to Ridge). Only with 2D tuning (λ12) does the Elastic Net perform comparably in both scenarios to the better single-penalty method. The pre-filter has little effect on prediction in most cases, except in the LASSO-favoring where it improves prediction by Ridge regression, and worsens prediction by λ12 Elastic Net by decreasing the relative importance of the λ1 penalty.
Fig. 4.
Fig. 4.
Model selection guidelines allow reproducible outcome prediction from tumor gene expression data. Kaplan–Meier plots for cross-validated risk prediction for lung adenocarcinoma patients from Beer et al., using the Elastic Net. A naive model is overfit to training data, as evidenced by reduced prediction accuracy in cross-validation compared with resubstitution in the training data.
Fig. 5.
Fig. 5.
Application to high-dimensional metagenomic data. ROC curves for classification of obese (n=37) and non-obese (n=48) individuals using metagenomic data describing the gut microbiota (Qin et al., 2010), trained by Elastic Net. High-dimensional features are no longer gene expression but the relative abundance of specific microbial pathways in the stool microbiome. Overfitting to the training set is again observed in resubstitution predictions, but cross-validation shows marginal evidence of independent predictive ability (AUC = 0.59, P=0.08).

References

    1. Beer D.G., et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 2002;8:816–824. - PubMed
    1. Boulesteix A.L. Reader's reaction to “Dimension reduction for classification with gene expression microarray data” by Dai et al (2006) Stat. Appl. Genet. Mol. Biol. 2006;5 Article16. - PubMed
    1. Bøvelstad H.M., et al. Predicting survival from microarray data - a comparative study. Bioinformatics. 2007;23:2080–2087. - PubMed
    1. Breiman L. Random Forests. Mach. Learn. 2001;45:5–32.
    1. Bühlmann P. Boosting algorithms: regularization, prediction and model fitting. Stat. Sci. 2007;22:477–505.

Publication types