Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
- PMID: 26319135
- PMCID: PMC4553217
- DOI: 10.1186/s12874-015-0066-2
Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
Abstract
Background: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research.
Methods: Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012-2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors.
Results: In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance.
Conclusions: BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers.
Figures




Similar articles
-
Model selection in medical research: a simulation study comparing Bayesian model averaging and stepwise regression.BMC Med Res Methodol. 2010 Dec 6;10:108. doi: 10.1186/1471-2288-10-108. BMC Med Res Methodol. 2010. PMID: 21134252 Free PMC article.
-
Evaluating variable selection methods for multivariable regression models: A simulation study protocol.PLoS One. 2024 Aug 9;19(8):e0308543. doi: 10.1371/journal.pone.0308543. eCollection 2024. PLoS One. 2024. PMID: 39121055 Free PMC article.
-
High-dimensional Cox models: the choice of penalty as part of the model building process.Biom J. 2010 Feb;52(1):50-69. doi: 10.1002/bimj.200900064. Biom J. 2010. PMID: 20166132
-
Empirical extensions of the lasso penalty to reduce the false discovery rate in high-dimensional Cox regression models.Stat Med. 2016 Jul 10;35(15):2561-73. doi: 10.1002/sim.6927. Epub 2016 Mar 10. Stat Med. 2016. PMID: 26970107 Review.
-
Variable selection - A review and recommendations for the practicing statistician.Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2. Biom J. 2018. PMID: 29292533 Free PMC article. Review.
Cited by
-
Pulmonary Function after Treatment for Childhood Cancer. A Report from the St. Jude Lifetime Cohort Study (SJLIFE).Ann Am Thorac Soc. 2016 Sep;13(9):1575-85. doi: 10.1513/AnnalsATS.201601-022OC. Ann Am Thorac Soc. 2016. PMID: 27391297 Free PMC article.
-
Predictors of improved and decreased range of motion after medial pivot total knee arthroplasty: A multicenter retrospective analysis.J Orthop. 2025 Apr 14;63:201-205. doi: 10.1016/j.jor.2025.04.002. eCollection 2025 May. J Orthop. 2025. PMID: 40291607
-
Variable selection in omics data: A practical evaluation of small sample sizes.PLoS One. 2018 Jun 21;13(6):e0197910. doi: 10.1371/journal.pone.0197910. eCollection 2018. PLoS One. 2018. PMID: 29927942 Free PMC article.
-
Impact of Medical Interventions and Comorbidities on Norwood Admission for Patients with Hypoplastic Left Heart Syndrome.Pediatr Cardiol. 2022 Feb;43(2):267-278. doi: 10.1007/s00246-022-02818-y. Epub 2022 Jan 15. Pediatr Cardiol. 2022. PMID: 35034159 Review.
-
Prediction of alcohol use disorder using personality disorder traits: a twin study.Addiction. 2018 Jan;113(1):15-24. doi: 10.1111/add.13951. Epub 2017 Aug 23. Addiction. 2018. PMID: 28734091 Free PMC article.
References
-
- George EI. The Variable Selection Problem. J Am Stat Assoc. 2000;95(452):1304–1308. doi: 10.1080/01621459.2000.10474336. - DOI
-
- Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins; 2008.
-
- Miller A. Subset Selection in Regression. Boca Raton: Taylor & Francis; 2002.
-
- Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York: Springer; 2002.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical