. 2022 Mar 21;23(1):97.

doi: 10.1186/s12859-022-04634-w.

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Michael Lau^{1

2}, Claudia Wigmann³, Sara Kress³, Tamara Schikowski³, Holger Schwender⁴

Affiliations

¹ Mathematical Institute, Heinrich Heine University, Düsseldorf, Germany. michael.lau@hhu.de.
² IUF - Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany. michael.lau@hhu.de.
³ IUF - Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany.
⁴ Mathematical Institute, Heinrich Heine University, Düsseldorf, Germany.

PMID: 35313824
PMCID: PMC8935722
DOI: 10.1186/s12859-022-04634-w

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Michael Lau et al. BMC Bioinformatics. 2022.

. 2022 Mar 21;23(1):97.

doi: 10.1186/s12859-022-04634-w.

Authors

Michael Lau^{1

2}, Claudia Wigmann³, Sara Kress³, Tamara Schikowski³, Holger Schwender⁴

Affiliations

¹ Mathematical Institute, Heinrich Heine University, Düsseldorf, Germany. michael.lau@hhu.de.
² IUF - Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany. michael.lau@hhu.de.
³ IUF - Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany.
⁴ Mathematical Institute, Heinrich Heine University, Düsseldorf, Germany.

PMID: 35313824
PMCID: PMC8935722
DOI: 10.1186/s12859-022-04634-w

Abstract

Background: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS.

Results: In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results.

Conclusions: When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted.

Keywords: Bagging; Elastic net; Epistasis; Logic regression; Polygenic risk scores; Random forests; Simulation study; Statistical learning; Variable selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Workflow of constructing and evaluating genetic risk scores

**Fig. 2**
Exemplary tree models for three binary input variables $X_{1}$ , $X_{2}$ and $X_{3}$ predicting two different classes $c_{0}$ and $c_{1}$ . In a, a classification tree is shown. b depicts a logic tree describing the Boolean expression $(X_{1}^{c} \land X_{2}) \lor (X_{1} \land X_{3}^{c})$ . Here, a true Boolean expression is identified as class $c_{1}$ and $c_{0}$ otherwise. Negated input variables/leaves are marked by white letters on a black background. Both trees are equivalent, i.e., they perform the same predictions for each predictor setting

**Fig. 3**
Mean AUC for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the first simulation scenario considering marginal effective SNPs evaluated on the test data

**Fig. 4**
Mean AUC for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the second simulation scenario incorporating interactions of SNPs evaluated on the test data. The Designs 2.1, 2.2, and 2.3 describe the scenarios where both interacting SNPs also exhibit marginal effects, only one of both SNPs shows a marginal signal or none of them induce a main effect, i.e., (j, k) = (1, 2), (1, 4), or (4, 5) in Eq. (3), respectively

**Fig. 5**
Mean AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data. The Designs 3.1 and 3.2 describe the scenarios where the GxE interacting SNP also exhibits a moderate marginal effect or where it does not induce a main effect, i.e., j = 2 or 5 in Eq. (4), respectively

**Fig. 6**
AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the application to data from the SALIA study evaluated on the test data. Results for single unadjusted models also considering the alternative genome-wide construction approach

**Fig. 7**
AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the application to data from the SALIA study evaluated on the test data. Results for the final age-adjusted models with different air pollution indicators

See this image and copyright information in PMC

Cited by

Machine Learning to Advance Human Genome-Wide Association Studies.
Sigala RE, Lagou V, Shmeliov A, Atito S, Kouchaki S, Awais M, Prokopenko I, Mahdi A, Demirkan A. Sigala RE, et al. Genes (Basel). 2023 Dec 25;15(1):34. doi: 10.3390/genes15010034. Genes (Basel). 2023. PMID: 38254924 Free PMC article. Review.
Transfer learning with false negative control improves polygenic risk prediction.
Jeng XJ, Hu Y, Venkat V, Lu TP, Tzeng JY. Jeng XJ, et al. PLoS Genet. 2023 Nov 27;19(11):e1010597. doi: 10.1371/journal.pgen.1010597. eCollection 2023 Nov. PLoS Genet. 2023. PMID: 38011285 Free PMC article.
Efficient gene-environment interaction testing through bootstrap aggregating.
Lau M, Kress S, Schikowski T, Schwender H. Lau M, et al. Sci Rep. 2023 Jan 17;13(1):937. doi: 10.1038/s41598-023-28172-4. Sci Rep. 2023. PMID: 36650248 Free PMC article.
From Data to Cure: A Comprehensive Exploration of Multi-omics Data Analysis for Targeted Therapies.
Mukherjee A, Abraham S, Singh A, Balaji S, Mukunthan KS. Mukherjee A, et al. Mol Biotechnol. 2025 Apr;67(4):1269-1289. doi: 10.1007/s12033-024-01133-6. Epub 2024 Apr 2. Mol Biotechnol. 2025. PMID: 38565775 Free PMC article. Review.

References

1. Billings LK, Florez JC. The genetics of type 2 diabetes: what have we learned from GWAS? Ann N Y Acad Sci. 2010;1212(1):59–77. - PMC - PubMed
1. Choi SW, Mak TSH, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020;15(9):2759–2772. - PMC - PubMed
1. Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9(3):1–17. - PMC - PubMed
1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018;19(9):581–590. - PubMed
1. Wray NR, Lin T, Austin J, McGrath JJ, Hickie IB, Murray GK, et al. From basic science to clinical application of polygenic risk scores: a primer. JAMA Psychiat. 2021;78(1):101–109. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Affiliations

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources