. 2024 May 30;14(1):12436.

doi: 10.1038/s41598-024-62945-9.

Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores

Yana Hrytsenko^#^{1

2

3}, Benjamin Shea^#³, Michael Elgart^{1

2}, Nuzulul Kurniansyah¹, Genevieve Lyons⁴, Alanna C Morrison⁵, April P Carson⁶, Bernhard Haring^{7

8}, Braxton D Mitchell⁹, Bruce M Psaty^{10

11

12

13}, Byron C Jaeger¹⁴, C Charles Gu¹⁵, Charles Kooperberg¹⁶, Daniel Levy^{17

18}, Donald Lloyd-Jones¹⁹, Eunhee Choi²⁰, Jennifer A Brody^{10

12}, Jennifer A Smith^{21

22}, Jerome I Rotter²³, Matthew Moll^{1

2

24

25}, Myriam Fornage^{5

26}, Noah Simon²⁷, Peter Castaldi^{1

2}, Ramon Casanova¹⁴, Ren-Hua Chung²⁸, Robert Kaplan^{7

16}, Ruth J F Loos^{29

30}, Sharon L R Kardia²¹, Stephen S Rich³¹, Susan Redline^{1

2

32}, Tanika Kelly³³, Timothy O'Connor^{9

34

35}, Wei Zhao^{21

22}, Wonji Kim²⁵, Xiuqing Guo²³, Yii-Der Ida Chen²³; Trans-Omics in Precision Medicine Consortium; Tamar Sofer^{36

37

38

39

40}

Affiliations

¹ Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.
² Department of Medicine, Harvard Medical School, Boston, MA, USA.
³ CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁵ Department of Epidemiology, School of Public Health, Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, TX, USA.
⁶ Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA.
⁷ Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA.
⁸ Department of Medicine III, Saarland University, Homburg, Saarland, Germany.
⁹ Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA.
¹⁰ Department of Medicine, University of Washington, Seattle, WA, USA.
¹¹ Department of Epidemiology, University of Washington, Seattle, WA, USA.
¹² Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA.
¹³ Health Systems and Population Health, University of Washington, Seattle, WA, USA.
¹⁴ Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
¹⁵ The Center for Biostatistics and Data Science, Washington University, St. Louis, USA.
¹⁶ Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA.
¹⁷ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁸ The Framingham Heart Study, Framingham, MA, USA.
¹⁹ Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
²⁰ Columbia Hypertension Laboratory, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA.
²¹ Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
²² Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA.
²³ Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
²⁴ VA Boston Healthcare System, West Roxbury, MA, USA.
²⁵ Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, USA.
²⁶ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
²⁷ Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA, USA.
²⁸ Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taipei City, Taiwan.
²⁹ The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³⁰ Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty for Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
³¹ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
³² Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
³³ Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA.
³⁴ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
³⁵ Program in Health Equity and Population Health, University of Maryland School of Medicine, Baltimore, MD, USA.
³⁶ Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA. tsofer@bidmc.harvard.edu.
³⁷ Department of Medicine, Harvard Medical School, Boston, MA, USA. tsofer@bidmc.harvard.edu.
³⁸ CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA. tsofer@bidmc.harvard.edu.
³⁹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. tsofer@bidmc.harvard.edu.
⁴⁰ Center for Life Sciences CLS-934, 3 Blackfan St., Boston, MA, 02115, USA. tsofer@bidmc.harvard.edu.

^# Contributed equally.

PMID: 38816422
PMCID: PMC11139858
DOI: 10.1038/s41598-024-62945-9

Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores

Yana Hrytsenko et al. Sci Rep. 2024.

. 2024 May 30;14(1):12436.

doi: 10.1038/s41598-024-62945-9.

Authors

Affiliations

¹ Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.
² Department of Medicine, Harvard Medical School, Boston, MA, USA.
³ CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
⁵ Department of Epidemiology, School of Public Health, Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, TX, USA.
⁶ Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA.
⁷ Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA.
⁸ Department of Medicine III, Saarland University, Homburg, Saarland, Germany.
⁹ Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA.
¹⁰ Department of Medicine, University of Washington, Seattle, WA, USA.
¹¹ Department of Epidemiology, University of Washington, Seattle, WA, USA.
¹² Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA.
¹³ Health Systems and Population Health, University of Washington, Seattle, WA, USA.
¹⁴ Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
¹⁵ The Center for Biostatistics and Data Science, Washington University, St. Louis, USA.
¹⁶ Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA.
¹⁷ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁸ The Framingham Heart Study, Framingham, MA, USA.
¹⁹ Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
²⁰ Columbia Hypertension Laboratory, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA.
²¹ Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
²² Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA.
²³ Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
²⁴ VA Boston Healthcare System, West Roxbury, MA, USA.
²⁵ Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, USA.
²⁶ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
²⁷ Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA, USA.
²⁸ Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taipei City, Taiwan.
²⁹ The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³⁰ Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty for Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
³¹ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
³² Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
³³ Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA.
³⁴ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
³⁵ Program in Health Equity and Population Health, University of Maryland School of Medicine, Baltimore, MD, USA.
³⁶ Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA. tsofer@bidmc.harvard.edu.
³⁷ Department of Medicine, Harvard Medical School, Boston, MA, USA. tsofer@bidmc.harvard.edu.
³⁸ CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA. tsofer@bidmc.harvard.edu.
³⁹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA. tsofer@bidmc.harvard.edu.
⁴⁰ Center for Life Sciences CLS-934, 3 Blackfan St., Boston, MA, 02115, USA. tsofer@bidmc.harvard.edu.

^# Contributed equally.

PMID: 38816422
PMCID: PMC11139858
DOI: 10.1038/s41598-024-62945-9

Abstract

We construct non-linear machine learning (ML) prediction models for systolic and diastolic blood pressure (SBP, DBP) using demographic and clinical variables and polygenic risk scores (PRSs). We developed a two-model ensemble, consisting of a baseline model, where prediction is based on demographic and clinical variables only, and a genetic model, where we also include PRSs. We evaluate the use of a linear versus a non-linear model at both the baseline and the genetic model levels and assess the improvement in performance when incorporating multiple PRSs. We report the ensemble model's performance as percentage variance explained (PVE) on a held-out test dataset. A non-linear baseline model improved the PVEs from 28.1 to 30.1% (SBP) and 14.3% to 17.4% (DBP) compared with a linear baseline model. Including seven PRSs in the genetic model computed based on the largest available GWAS of SBP/DBP improved the genetic model PVE from 4.8 to 5.1% (SBP) and 4.7 to 5% (DBP) compared to using a single PRS. Adding additional 14 PRSs computed based on two independent GWASs further increased the genetic model PVE to 6.3% (SBP) and 5.7% (DBP). PVE differed across self-reported race/ethnicity groups, with primarily all non-White groups benefitting from the inclusion of additional PRSs. In summary, non-linear ML models improves BP prediction in models incorporating diverse populations.

PubMed Disclaimer

Conflict of interest statement

B Psaty serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. G Lyons is currently a full time employee of Alexion, AstraZeneca Rare Disease, and hold stock in the company, however, her contributions to the present manuscript were performed as part of her previous affiliation at the Harvard T.H. Chan School of Public Health and this work is not related to her current occupation and affiliation. M Moll has received grant funding from Bayer and consulting fees from TriNetX, 2ndMD, TheaHealth, Sitka, Verona Pharma, and Axon Advisors. All other authors report no competing interests.

Figures

**Figure 1**
Study design. (a) The proposed ensemble model framework. The ensemble is composed of two models. The baseline model, trained on covariates ( $X_{b}$ ) only for prediction of SBP and DBP ( ${\hat{y}}_{b}$ ). To assess the accuracy of the baseline model we calculated the residuals (baseline residuals $r_{b}$ ) by subtracting the predicted value of SBP/DBP from the actual value of SBP/DBP. The genetic model was trained on a subset of the covariates, and genetic components (global PRSs) for prediction of the baseline model residuals $r_{b}$ . We measured the accuracy of the genetic model by subtracting predicted genetic residuals ${\hat{r}}_{g}$ from baseline residuals $r_{b}$ . The overall prediction of BP by the ensemble model is the sum of the predicted baseline BP ${\hat{y}}_{b}$ (by the baseline model) and the predicted baseline residuals ${\hat{r}}_{b}$ (by the genetic model). The accuracy of the ensemble model was assessed by calculating percent variance explained (PVE) by two models jointly. (b) The split of the primary, TOPMed dataset, into training and testing sets followed by the fivefold cross validation procedure where the training dataset is further split into 5 equal parts with one part designated for testing (repeated 5 times with 1/5 of the training data being designated at random for testing at each iteration). (c) Increasing levels of genetic models’ complexity where each new model included additional PRSs. (d) The process of calculating local PRSs per LD-blocks (secondary analysis). *BBJ* BioBank Japan, BP blood pressure, *GWAS* genome wide association study, LD linkage disequilibrium, *Level* model complexity level, *MVP* Million Veteran Program, P p-value threshold, *PRS* polygenic risk score, *SNPs* single-nucleotide polymorphisms, *TOPMed* Trans-Omics for Precision Medicine, *UKBB* + *ICBP* UK Biobank and International Consortium for Blood Pressure.

**Figure 2**
Estimated phenotypic PVE of baseline models fitted using non-linear ML and linear models. Estimated PVEs in the TOPMed test dataset for baseline model performance for prediction of SBP and DBP in the overall test dataset and stratified by self-reported race/ethnicity (White N = 10,877, Hispanic/Latino N = 3831, Black N = 3657, Asian N = 403 for DBP; White N = 10,823, Hispanic/Latino N = 3877, Black N = 3674, Asian N = 374 for SBP). The visualized 95% confidence intervals were computed as the 2.5% and 97.5% percentiles of the bootstrap distribution of the PVEs estimated over the test dataset. *PVE* percent variance explained, *TOPMed* Trans-Omics for Precision Medicine, *SBP* systolic blood pressure, *DBP* diastolic blood pressure.

**Figure 3**
Comparison of genetic and ensemble model performance in TOPMed test dataset. (a) Estimated PVEs in the TOPMed test dataset obtained by genetic models incorporating one or more PRSs according to the three complexity levels. Level 1: a single PRS based on the UKBB + ICBP GWAS. Level 2: PRSs based on the UKBB + ICBP GWAS based on seven p-value thresholds. Level 3: 21 PRSs, 7 PRSs based on each of the UKBB + ICBP, MVP, and BBJ GWAS. PVE is reported for predicting residuals from the baseline model, where the baseline model was a non-linear ML model and only used non-genetic covariates. (b) Estimated PVEs in the TOPMed test dataset for ensemble model at the raw phenotypic level. PVEs are reported for models of SBP and DBP, in the overall test dataset and stratified by self-reported race/ethnicity (White N = 10,877, Hispanic/Latino N = 3831, Black N = 3657, Asian N = 403 for DBP; White N = 10,823, Hispanic/Latino N = 3877, Black N = 3674, Asian N = 374 for SBP). The visualized 95% confidence intervals were computed as the 2.5% and 97.5% percentiles of the bootstrap distribution of the PVEs estimated over the test dataset. *PVE* percent variance explained, *TOPMed* Trans-Omics for Precision Medicine, *SBP* systolic blood pressure, *DBP* diastolic blood pressure, *PRS* polygenic risk score.

**Figure 4**
Integration of LASSO feature selection tool into the ensemble model workflow. (a) The workflow of the ensemble model with the integration of the LASSO variable selection tool. To include local PRSs in the ensemble model while attempting to avoid overfitting, we added a LASSO selection step to the ensemble model development. As visualized, the residuals of the baseline model were used as the outcome in LASSO penalized regression with the local PRSs as features. LASSO substantially reduced the number of local PRSs (to 827 for SBP and 224 for DBP). The local PRSs selected by LASSO were then used as an input into the genetic model for prediction of the baseline residuals ( ${\hat{r}}_{b}$ ). (b) Genomic locations of local PRSs, calculated over predefined LD-regions, selected by LASSO for SBP and DBP. (c) Comparison between the estimated PVE in the TOPMed test dataset for ensemble model Level 3 using global PRSs and the ensemble model using Linear regression and local PRSs. PVEs are reported for models of SBP and DBP, in the overall test dataset and stratified by self-reported race/ethnicity (White N = 10,877, Hispanic/Latino N = 3831, Black N = 3657, Asian N = 403 for DBP; White N = 10,823, Hispanic/Latino N = 3877, Black N = 3674, Asian N = 374 for SBP). The visualized 95% confidence intervals were computed as the 2.5% and 97.5% percentiles of the bootstrap distribution of the PVEs estimated over the test dataset. BP blood pressure, *DBP* diastolic blood pressure, *LASSO* least absolute shrinkage and selection operator, *PRS* polygenic risk score, *SBP* systolic blood pressure.

**Figure 5**
Application of the model trained on the TOPMed dataset to the MGBB data. (a) The figure visualizes the workflow of the Ensemble model with the baseline being trained on the MGBB dataset and application of the genetic models, trained on the TOPMed data, incorporating one or more PRSs according to the three model complexity levels. Level 1: a single PRS based on the UKBB + ICBP GWAS. Level 2: seven PRSs based on the UKBB + ICBP GWAS, difference p-value thresholds. Level 3: 21 PRSs, 7 PRSs based on each of the UKBB + ICBP, MVP, and BBJ GWAS. (b) Estimated PVE in the MGBB test dataset for XGBoost genetic models fitted on the TOPMed dataset of three levels of complexity with baseline model fitted using TOPMed baseline model weights (top) and using MGBB baseline model weights (bottom). PVEs are shown for the performance in prediction of the second order of residuals for SBP and DBP phenotypes in the overall test dataset and stratified by race/ethnicity (White N = 7985, Black N = 412, Asian N = 200). BP blood pressure, *MGBB* Mass General Brigham Biobank, *PRS* polygenic risk score, *TOPMed* trans-omics in precision medicine project.

See this image and copyright information in PMC

Update of

Machine learning models for blood pressure phenotypes combining multiple polygenic risk scores.
Hrytsenko Y, Shea B, Elgart M, Kurniansyah N, Lyons G, Morrison AC, Carson AP, Haring B, Mitchel BD, Psaty BM, Jaeger BC, Gu CC, Kooperberg C, Levy D, Lloyd-Jones D, Choi E, Brody JA, Smith JA, Rotter JI, Moll M, Fornage M, Simon N, Castaldi P, Casanova R, Chung RH, Kaplan R, Loos RJF, Kardia SLR, Rich SS, Redline S, Kelly T, O'Connor T, Zhao W, Kim W, Guo X, Der Ida Chen Y; Trans-Omics in Precision Medicine Consortium; Sofer T. Hrytsenko Y, et al. medRxiv [Preprint]. 2023 Dec 14:2023.12.13.23299909. doi: 10.1101/2023.12.13.23299909. medRxiv. 2023. Update in: Sci Rep. 2024 May 30;14(1):12436. doi: 10.1038/s41598-024-62945-9. PMID: 38168328 Free PMC article. Updated. Preprint.

References

1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19(9):581–590. doi: 10.1038/s41576-018-0018-x. - DOI - PubMed
1. Choi SW, Mak TS, O’Reilly PF. Tutorial: A guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15(9):2759–2772. doi: 10.1038/s41596-020-0353-1. - DOI - PMC - PubMed
1. Ho DSW, et al. Machine learning SNP based prediction for precision medicine. Front. Genet. 2019;10:1. doi: 10.3389/fgene.2019.00267. - DOI - PMC - PubMed
1. Elgart M, et al. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun. Biol. 2022;5(1):856. doi: 10.1038/s42003-022-03812-z. - DOI - PMC - PubMed
1. Tibshirani R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996;58(1):267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores

Affiliations

Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources