. 2022 Aug 22;5(1):856.

doi: 10.1038/s42003-022-03812-z.

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Michael Elgart^#^{1

2}, Genevieve Lyons^#^{3

4}, Santiago Romero-Brufau^{4

5}, Nuzulul Kurniansyah³, Jennifer A Brody⁶, Xiuqing Guo⁷, Henry J Lin⁷, Laura Raffield⁸, Yan Gao⁹, Han Chen^{10

11}, Paul de Vries¹⁰, Donald M Lloyd-Jones¹², Leslie A Lange¹³, Gina M Peloso¹⁴, Myriam Fornage^{10

15}, Jerome I Rotter⁷, Stephen S Rich¹⁶, Alanna C Morrison¹⁰, Bruce M Psaty¹⁷, Daniel Levy^{18

19}, Susan Redline^{3

20}; NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium; Tamar Sofer^{21

22

23}

Collaborators, Affiliations

Collaborators

NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium:
Paul de Vries

Affiliations

¹ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA. melgart@bwh.harvard.edu.
² Department of Medicine, Harvard Medical School, Boston, MA, USA. melgart@bwh.harvard.edu.
³ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁵ Department of Medicine, Mayo Clinic, Rochester, MN, USA.
⁶ Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA.
⁷ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
⁸ Department of Genetics, University of North Carolina, Chapel Hill, NC, USA.
⁹ The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA.
¹⁰ Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA.
¹¹ Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
¹² Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
¹³ Department of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA.
¹⁴ Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
¹⁵ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
¹⁶ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
¹⁷ Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA.
¹⁸ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁹ The Framingham Heart Study, Framingham, MA, USA.
²⁰ Department of Medicine, Harvard Medical School, Boston, MA, USA.
²¹ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA. tsofer@bwh.harvard.edu.
²² Department of Medicine, Harvard Medical School, Boston, MA, USA. tsofer@bwh.harvard.edu.
²³ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. tsofer@bwh.harvard.edu.

^# Contributed equally.

PMID: 35995843
PMCID: PMC9395509
DOI: 10.1038/s42003-022-03812-z

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Michael Elgart et al. Commun Biol. 2022.

. 2022 Aug 22;5(1):856.

doi: 10.1038/s42003-022-03812-z.

Authors

Collaborators

NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium:
Paul de Vries

Affiliations

¹ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA. melgart@bwh.harvard.edu.
² Department of Medicine, Harvard Medical School, Boston, MA, USA. melgart@bwh.harvard.edu.
³ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁵ Department of Medicine, Mayo Clinic, Rochester, MN, USA.
⁶ Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA.
⁷ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
⁸ Department of Genetics, University of North Carolina, Chapel Hill, NC, USA.
⁹ The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA.
¹⁰ Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA.
¹¹ Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
¹² Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
¹³ Department of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA.
¹⁴ Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
¹⁵ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
¹⁶ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
¹⁷ Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA.
¹⁸ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁹ The Framingham Heart Study, Framingham, MA, USA.
²⁰ Department of Medicine, Harvard Medical School, Boston, MA, USA.
²¹ Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA. tsofer@bwh.harvard.edu.
²² Department of Medicine, Harvard Medical School, Boston, MA, USA. tsofer@bwh.harvard.edu.
²³ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. tsofer@bwh.harvard.edu.

^# Contributed equally.

PMID: 35995843
PMCID: PMC9395509
DOI: 10.1038/s42003-022-03812-z

Abstract

Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. G.L. is a full-time employee of Valo Health, a technology company, but this work was not conducted in that position and is not relevant to that position. All other co-authors declare no competing interests.

Figures

**Fig. 1. PRSice, LDpred2, and Lassosum2 Linear PRS results.**
Best-performing PRSice (gray) compared to best-performing LDpred2 (orange) and best Lassosum2 (brown) across the hyperparameters tuned using the training data.

**Fig. 2. Flow chart of ensemble model structure.**
The model relies on jointly training the LASSO and XGBoost model to identify the optimal value for the L1 regularization parameter and the number of boosting steps. CV indicates cross-validation, α refers to the regularization parameter, and Ɵ is the number of boosted trees for XGBoost. The optimal values for these hyperparameters were selected using threefold CV for the mean squared error of the XGBoost model.

**Fig. 3. Nonlinear model consistently outperforms linear ones for prediction of multiple complex phenotypes in multi-ethnic dataset.**
Linear (PRS-pink), linear-regularized (LASSO—teal), and nonlinear (XGBoost—gray, purple) models were employed to predict the harmonized phenotypes from SNP data from TOPMed following adjustment for covariates. Two versions of the XGBoost algorithm are shown with the first model employing only the SNPs as features (gray; XGBoost alone) and a second model which had the PRS as one of the features as well (XGBoost with PRS). The LASSO algorithm (teal) was trained on the same set of SNPs as the XGBoost. The inset (gray) depicts estimated heritability for same phenotypes in the same database using the REML approach with error bars of 95% confidence intervals estimated through restricted maximum-likelihood estimate.

**Fig. 4. Model performance differ by group, with XGBoost consistently outperforming PRS.**
Performance of the PRS (pink) and XGBoost+PRS (purple) models trained on the combined dataset when applied to the prediction of the 5 phenotypes in separate race/ethnicities. Panels a, b, and c refer to White, Black and Hispanic/Latino groups, respectively.

**Fig. 5. Multi-ethnic XGBoost model performs on par with the race/ethnic-specific models.**
XGBoost with PRS models were trained either on the combined dataset containing all participants, (pink) or on each race/ethnic group separately (teal, gray and purple). The models were then evaluated on each of the groups (a Black, b Hispanic/Latino, and c White).

See this image and copyright information in PMC

References

1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. - DOI - PubMed
1. Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15:2759–2772. doi: 10.1038/s41596-020-0353-1. - DOI - PMC - PubMed
1. Hemani G, et al. Detection and replication of epistasis influencing transcription in humans. Nature. 2014;508:249–253. doi: 10.1038/nature13005. - DOI - PMC - PubMed
1. Jiang Y, Schmidt RH, Reif JC. Haplotype-based genome-wide prediction models exploit local epistatic interactions among markers. G3. 2018;8:1687–1699. doi: 10.1534/g3.117.300548. - DOI - PMC - PubMed
1. Miller AK, et al. A novel mapping strategy utilizing mouse chromosome substitution strains identifies multiple epistatic interactions that regulate complex traits. G3. 2020;10:4553–4563. doi: 10.1534/g3.120.401824. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Collaborators

Affiliations

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources