Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Mar;64(1):115-23.
doi: 10.1111/j.1541-0420.2007.00843.x. Epub 2007 Jun 30.

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Affiliations

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Howard D Bondell et al. Biometrics. 2008 Mar.

Abstract

Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Graphical representation of the constraint region in the (β1, β2) plane for the LASSO, Elastic Net, and OSCAR. Note that all are non-differentiable at the axes. (a) Constraint region for the Lasso (solid line), along with three choices of tuning parameter for the Elastic Net. (b) Constraint region for the OSCAR for four values of c. The solid line represents c = 0, the LASSO.
Figure 2
Figure 2
Graphical representation in the (β1, β2) plane. The OSCAR solution is the first time the contours of the sum-of-squares function hits the octagonal constraint region. (a) Contours centered at OLS estimate, low correlation (ρ = .15). Solution occurs at β^1=0. (b) Contours centered at OLS estimate, high correlation (ρ = .85). Solution occurs at β^1=β^2.
Figure 3
Figure 3
Graphical representation of the correlation matrix of the 15 predictors for the soil data. The magnitude of each pairwise correlation is represented by a block in the grayscale image.
Figure 4
Figure 4
LASSO solution paths for the soil data. Absolute value of the 15 coefficients as a function of s, the proportion of the OLS norm, for the fixed value of c = 0, the LASSO. The vertical lines represent the best LASSO models in terms of the GCV and the 5-fold cross-validation criteria. (a) Solution paths for the 7 cation-related coefficients. (b) Solution paths for the remaining 8 coefficients.
Figure 5
Figure 5
OSCAR solution paths for the soil data. Absolute value of the 15 coefficients as a function of s, the proportion of the OLS norm, for the value of c = 4 as chosen by both GCV and 5-fold cross-validation. The vertical lines represent the selected models based on the two criteria. (a) Solution paths for the 7 cation-related coefficients. (b) Solution paths for the remaining 8 coefficients.

References

    1. Block HW. Continuous multivariate exponential extensions. In: Barlow RE, Fussel JB, Singpurwalla N, editors. Reliability and Failure Tree Analysis. SIAM; Philadelphia: 1975. pp. 285–306.
    1. Dettling M, Bühlmann P. Finding predictive gene groups from microarray data. J. Multivariate Anal. 2004;90:106–131.
    1. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann. Statist. 2004;32:407–499.
    1. Hoerl AE, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
    1. Jörnsten R, Yu B. Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics. 2003;19:1100–1109. - PubMed