Optimized application of penalized regression methods to diverse genomic data

Levi Waldron¹, Melania Pintilie, Ming-Sound Tsao, Frances A Shepherd, Curtis Huttenhower, Igor Jurisica

Affiliations

PMID: 22156367
PMCID: PMC3232376
DOI: 10.1093/bioinformatics/btr591

Optimized application of penalized regression methods to diverse genomic data

Levi Waldron et al. Bioinformatics. 2011.

. 2011 Dec 15;27(24):3399-406.

doi: 10.1093/bioinformatics/btr591.

Authors

Levi Waldron¹, Melania Pintilie, Ming-Sound Tsao, Frances A Shepherd, Curtis Huttenhower, Igor Jurisica

Affiliation

¹ Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA.

PMID: 22156367
PMCID: PMC3232376
DOI: 10.1093/bioinformatics/btr591

Abstract

Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has not been determined.

Results: Through simulation of contrasting scenarios of correlated high-dimensional survival data, we compared the LASSO, Ridge and Elastic Net penalties for prediction and variable selection. We found that a 2D tuning of the Elastic Net penalties was necessary to avoid mimicking the performance of LASSO or Ridge regression. Furthermore, we found that in a simulated scenario favoring the LASSO penalty, a univariate pre-filter made the Elastic Net behave more like Ridge regression, which was detrimental to prediction performance. We demonstrate the real-life application of these methods to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Based on these results, we provide an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data.

Availability and implementation: A parallelized implementation of the methods presented for regression and for simulation of synthetic data is provided as the pensim R package, available at http://cran.r-project.org/web/packages/pensim/index.html.

Contact: chuttenh@hsph.harvard.edu; juris@ai.utoronto.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(A) Methodology for model selection and validation of high-dimensional data. Objectives include both feature selection and outcome prediction, e.g. for patient survival given tumor gene expression data. A nearly unbiased assessment of prediction accuracy for small samples sizes is obtained by repeating all steps of model selection in each iteration of the cross-validation. Variable selection and model conditioning are achieved within the training sets by an optional, permissive univariate pre-filter followed by repeated cross-validation for parameter tuning. These steps are detailed in Section 4. (B) Over-fitting occurs in spite of tuning the models by cross-validation, as evidenced by reduced prediction accuracy in simulated test sets compared to resubstitution of training data.

**Fig. 2.**
Optimized values of the Elastic Net tuning parameters in simulated scenarios favoring the LASSO and the Ridge penalties, with comparison to LASSO and Ridge regression. Selected values of the Elastic Net tuning parameters depend on the nature of the problem at hand (left half versus right half), whether pre-filtering precedes the tuning (inner left versus right), and the tuning strategy (represented in each row). In both scenarios, with and without pre-filtering, sequential tuning of the Elastic Net penalties (λ₁−λ₂ and λ₂−λ₁ methods) was dominated by the first penalty tuned, as evidenced by the similarity of values of that penalty to the Ridge or LASSO penalty shown in the adjacent histogram, and smaller values of the second penalty tuned compared with the other single-penalty regression or sequential-tuning regression. Assessment of model prediction and precision of variable selection are correspondingly similar for these methods (Fig. 3 and Supplementary Fig. S3). Note the different y-axis scale for λ₂−λ₁ Elastic Net and Ridge regression in the LASSO-favoring scenario. Application of a univariate pre-filter reduced the relative influence of the λ₁ penalty, particularly in the LASSO-favoring scenario (for example, 9 versus 10). Univariate pre-filtering (P<0.1) reduced the tuned values of all penalty parameters and, in particular, reduced the influence of λ₁ relative to λ₂ in the λ₁+λ₂ Elastic Net (panels 9 versus 10 and 19 versus 20). These results show that sequential tuning of the λ₁ penalty Elastic Net (λ₂−λ₁ and λ₁−λ₂ methods) is not adequate to enjoy any benefit over LASSO and Ridge regression, and that even in a problem where the λ₁ penalty is preferred, application of a univariate pre-filter causes the λ₂ penalty to dominate the λ₁+λ₂ Elastic Net.

**Fig. 3.**
Ranking of methods for prediction accuracy in two scenarios, simulated to favor (A) the LASSO penalty and (B) the Ridge penalty. In both scenarios and with all levels of pre-filtering, sequential tuning of the Elastic Net is dominated by the first penalty tuned (λ₁−λ₂ is similar to LASSO, and λ₁−λ₂ is similar to Ridge). Only with 2D tuning (λ₁+λ₂) does the Elastic Net perform comparably in both scenarios to the better single-penalty method. The pre-filter has little effect on prediction in most cases, except in the LASSO-favoring where it improves prediction by Ridge regression, and worsens prediction by λ₁+λ₂ Elastic Net by decreasing the relative importance of the λ₁ penalty.

**Fig. 4.**
Model selection guidelines allow reproducible outcome prediction from tumor gene expression data. Kaplan–Meier plots for cross-validated risk prediction for lung adenocarcinoma patients from Beer *et al.*, using the Elastic Net. A naive model is overfit to training data, as evidenced by reduced prediction accuracy in cross-validation compared with resubstitution in the training data.

**Fig. 5.**
Application to high-dimensional metagenomic data. ROC curves for classification of obese (n=37) and non-obese (n=48) individuals using metagenomic data describing the gut microbiota (Qin *et al.*, 2010), trained by Elastic Net. High-dimensional features are no longer gene expression but the relative abundance of specific microbial pathways in the stool microbiome. Overfitting to the training set is again observed in resubstitution predictions, but cross-validation shows marginal evidence of independent predictive ability (AUC = 0.59, P=0.08).

See this image and copyright information in PMC

References

1. Beer D.G., et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 2002;8:816–824. - PubMed
1. Boulesteix A.L. Reader's reaction to “Dimension reduction for classification with gene expression microarray data” by Dai et al (2006) Stat. Appl. Genet. Mol. Biol. 2006;5 Article16. - PubMed
1. Bøvelstad H.M., et al. Predicting survival from microarray data - a comparative study. Bioinformatics. 2007;23:2080–2087. - PubMed
1. Breiman L. Random Forests. Mach. Learn. 2001;45:5–32.
1. Bühlmann P. Boosting algorithms: regularization, prediction and model fitting. Stat. Sci. 2007;22:477–505.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimized application of penalized regression methods to diverse genomic data

Affiliation

Optimized application of penalized regression methods to diverse genomic data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources