Enhancing selection of alcohol consumption-associated genes by random forest
- PMID: 38606596
- PMCID: PMC11216877
- DOI: 10.1017/S0007114524000795
Enhancing selection of alcohol consumption-associated genes by random forest
Abstract
Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.
Keywords: Alcohol consumption; Boruta; CVD; Gene expression; Machine learning; random forest.
Conflict of interest statement
Figures


Similar articles
-
Comparison of machine learning models for predicting stroke risk in hypertensive patients: Lasso regression model, random forest model, Boruta algorithm model, and Boruta algorithm combined with Lasso regression model.Medicine (Baltimore). 2025 May 30;104(22):e42690. doi: 10.1097/MD.0000000000042690. Medicine (Baltimore). 2025. PMID: 40441184 Free PMC article.
-
[Constructing a predictive model for the death risk of patients with septic shock based on supervised machine learning algorithms].Zhonghua Wei Zhong Bing Ji Jiu Yi Xue. 2024 Apr;36(4):345-352. doi: 10.3760/cma.j.cn121430-20230930-00832. Zhonghua Wei Zhong Bing Ji Jiu Yi Xue. 2024. PMID: 38813626 Chinese.
-
Prediction and feature selection of low birth weight using machine learning algorithms.J Health Popul Nutr. 2024 Oct 12;43(1):157. doi: 10.1186/s41043-024-00647-8. J Health Popul Nutr. 2024. PMID: 39396025 Free PMC article.
-
Machine learning-derived peripheral blood transcriptomic biomarkers for early lung cancer diagnosis: Unveiling tumor-immune interaction mechanisms.Biofactors. 2025 Jan-Feb;51(1):e2129. doi: 10.1002/biof.2129. Epub 2024 Oct 16. Biofactors. 2025. PMID: 39415336 Free PMC article.
-
Feature Selection and Machine Learning Approaches in Prediction of Current E-Cigarette Use Among U.S. Adults in 2022.Int J Environ Res Public Health. 2024 Nov 6;21(11):1474. doi: 10.3390/ijerph21111474. Int J Environ Res Public Health. 2024. PMID: 39595741 Free PMC article.
References
-
- Chait A, Mancini M, February AW, et al. Clinical and metabolic study of alcoholic hyperlipidaemia. Lancet. 1972;2(7767):62–4. - PubMed
-
- Chikritzhs TN, Naimi TS, Stockwell TR, et al. Mendelian randomisation meta-analysis sheds doubt on protective associations between ‘moderate’ alcohol consumption and coronary heart disease. Evid Based Med. 2015;20(1):38. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical