This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Dec 14:2023.12.13.23299909.

doi: 10.1101/2023.12.13.23299909.

Machine learning models for blood pressure phenotypes combining multiple polygenic risk scores

Yana Hrytsenko^{1

2

3}, Benjamin Shea³, Michael Elgart^{1

2}, Nuzulul Kurniansyah¹, Genevieve Lyons⁴, Alanna C Morrison⁵, April P Carson⁶, Bernhard Haring^{7

8}, Braxton D Mitchel⁹, Bruce M Psaty^{10

11

12

13}, Byron C Jaeger¹⁴, C Charles Gu¹⁵, Charles Kooperberg¹⁶, Daniel Levy^{17

18}, Donald Lloyd-Jones¹⁹, Eunhee Choi²⁰, Jennifer A Brody^{10

12}, Jennifer A Smith^{21

22}, Jerome I Rotter²³, Matthew Moll^{1

2

24}, Myriam Fornage^{5

25}, Noah Simon²⁶, Peter Castaldi^{1

2}, Ramon Casanova¹³, Ren-Hua Chung²⁷, Robert Kaplan^{28

7}, Ruth J F Loos^{29

30}, Sharon L R Kardia²¹, Stephen S Rich³¹, Susan Redline^{2

32}, Tanika Kelly³³, Timothy O'Connor⁸, Wei Zhao^{21

22}, Wonji Kim³⁴, Xiuqing Guo²³, Yii Der Ida Chen²³; Trans-Omics in Precision Medicine Consortium; Tamar Sofer^{1

2

3

4}

Affiliations

¹ Department of Medicine, Brigham and Women's Hospital, Boston, MA.
² Department of Medicine, Harvard Medical School, Boston, MA.
³ CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.
⁵ Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA.
⁶ Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA.
⁷ Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA.
⁸ Department of Medicine III, Saarland University, Homburg, Saarland, Germany.
⁹ Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA.
¹⁰ Department of Medicine, University of Washington, Seattle, WA, USA.
¹¹ Department of Epidemiology, University of Washington, Seattle, WA, USA.
¹² Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA.
¹³ Health Systems and Population Health, University of Washington, Seattle, WA, USA.
¹⁴ Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
¹⁵ The Center for Biostatistics and Data Science, Washington University, St. Louis, USA.
¹⁶ Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA.
¹⁷ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁸ The Framingham Heart Study, Framingham, MA, USA.
¹⁹ Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
²⁰ Columbia Hypertension Laboratory, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA.
²¹ Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
²² Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA.
²³ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
²⁴ VA Boston Healthcare System, West Roxbury, MA, USA.
²⁵ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
²⁶ Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA.
²⁷ Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taipei City, Taiwan.
²⁸ Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
²⁹ The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³⁰ Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty for Health and Medical Sciences, University of Copenhagen, Denmark, DK.
³¹ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
³² Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
³³ Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA.
³⁴ Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital.

PMID: 38168328
PMCID: PMC10760279
DOI: 10.1101/2023.12.13.23299909

Machine learning models for blood pressure phenotypes combining multiple polygenic risk scores

Yana Hrytsenko et al. medRxiv. 2023.

[Preprint]. 2023 Dec 14:2023.12.13.23299909.

doi: 10.1101/2023.12.13.23299909.

Authors

Affiliations

¹ Department of Medicine, Brigham and Women's Hospital, Boston, MA.
² Department of Medicine, Harvard Medical School, Boston, MA.
³ CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA.
⁴ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.
⁵ Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA.
⁶ Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA.
⁷ Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA.
⁸ Department of Medicine III, Saarland University, Homburg, Saarland, Germany.
⁹ Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA.
¹⁰ Department of Medicine, University of Washington, Seattle, WA, USA.
¹¹ Department of Epidemiology, University of Washington, Seattle, WA, USA.
¹² Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA.
¹³ Health Systems and Population Health, University of Washington, Seattle, WA, USA.
¹⁴ Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
¹⁵ The Center for Biostatistics and Data Science, Washington University, St. Louis, USA.
¹⁶ Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA.
¹⁷ The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA.
¹⁸ The Framingham Heart Study, Framingham, MA, USA.
¹⁹ Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
²⁰ Columbia Hypertension Laboratory, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA.
²¹ Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
²² Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA.
²³ The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA.
²⁴ VA Boston Healthcare System, West Roxbury, MA, USA.
²⁵ Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
²⁶ Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA.
²⁷ Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taipei City, Taiwan.
²⁸ Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.
²⁹ The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³⁰ Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty for Health and Medical Sciences, University of Copenhagen, Denmark, DK.
³¹ Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA.
³² Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
³³ Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA.
³⁴ Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital.

PMID: 38168328
PMCID: PMC10760279
DOI: 10.1101/2023.12.13.23299909

Update in

Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores.
Hrytsenko Y, Shea B, Elgart M, Kurniansyah N, Lyons G, Morrison AC, Carson AP, Haring B, Mitchell BD, Psaty BM, Jaeger BC, Gu CC, Kooperberg C, Levy D, Lloyd-Jones D, Choi E, Brody JA, Smith JA, Rotter JI, Moll M, Fornage M, Simon N, Castaldi P, Casanova R, Chung RH, Kaplan R, Loos RJF, Kardia SLR, Rich SS, Redline S, Kelly T, O'Connor T, Zhao W, Kim W, Guo X, Ida Chen YD; Trans-Omics in Precision Medicine Consortium; Sofer T. Hrytsenko Y, et al. Sci Rep. 2024 May 30;14(1):12436. doi: 10.1038/s41598-024-62945-9. Sci Rep. 2024. PMID: 38816422 Free PMC article.

Abstract

We construct non-linear machine learning (ML) prediction models for systolic and diastolic blood pressure (SBP, DBP) using demographic and clinical variables and polygenic risk scores (PRSs). We developed a two-model ensemble, consisting of a baseline model, where prediction is based on demographic and clinical variables only, and a genetic model, where we also include PRSs. We evaluate the use of a linear versus a non-linear model at both the baseline and the genetic model levels and assess the improvement in performance when incorporating multiple PRSs. We report the ensemble model's performance as percentage variance explained (PVE) on a held-out test dataset. A non-linear baseline model improved the PVEs from 28.1% to 30.1% (SBP) and 14.3% to 17.4% (DBP) compared with a linear baseline model. Including seven PRSs in the genetic model computed based on the largest available GWAS of SBP/DBP improved the genetic model PVE from 4.8% to 5.1% (SBP) and 4.7% to 5% (DBP) compared to using a single PRS. Adding additional 14 PRSs computed based on two independent GWASs further increased the genetic model PVE to 6.3% (SBP) and 5.7% (DBP). PVE differed across self-reported race/ethnicity groups, with primarily all non-White groups benefitting from the inclusion of additional PRSs.

PubMed Disclaimer

Conflict of interest statement

Conflict of interests B Psaty serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. G Lyons is currently an employee of Alexion Pharmaceuticals, however, her contributions to the present manuscript were performed as part of her previous affiliation at the Harvard T.H. Chan School of Public Health and this work is not related to her current occupation and affiliation. M Moll has received grant funding from Bayer and consulting fees from TriNetX, 2ndMD, TheaHealth, Sitka, Verona Pharma, and Axon Advisors.

Figures

**Figure 1:. Study design.**
Panel a: the proposed ensemble model framework. The ensemble is composed of two models. The baseline model, trained on covariates $(X_{b})$ only for prediction of SBP and DBP $({\hat{y}}_{b})$ . To assess the accuracy of the baseline model we calculated the residuals (baseline residuals $r_{b}$ ) by subtracting the predicted value of SBP/DBP from the actual value of SBP/DBP. The genetic model was trained on a subset of the covariates, and genetic components (global PRSs) for prediction of the baseline model residuals $r_{b}$ . We measured the accuracy of the genetic model by subtracting predicted genetic residuals ${\hat{r}}_{g}$ from baseline residuals $r_{b}$ . The overall prediction of BP by the ensemble model is the sum of the predicted baseline BP ${\hat{y}}_{b}$ (by the baseline model) and the predicted baseline residuals ${\hat{r}}_{b}$ (by the genetic model). The accuracy of the ensemble model was assessed by calculating percent variance explained (PVE) by two models jointly. Panel b: the split of the primary, TOPMed dataset, into training and testing sets followed by the 5-fold cross validation procedure where the training dataset is further split into 5 equal parts with one part designated for testing (repeated 5 times with 1/5 of the training data being designated at random for testing at each iteration). Panel c: increasing levels of genetic models’ complexity where each new model included additional PRSs. Panel d: the process of calculating local PRSs per LD-blocks (secondary analysis). BBJ: BioBank Japan. BP: blood pressure. GWAS: genome wide association study. LD: linkage disequilibrium. Level: model complexity level. MVP: Million Veteran Program. P: p-value threshold. PRS: polygenic risk score. SNPs: single-nucleotide polymorphisms. TOPMed: Trans-Omics for Precision Medicine project. UKBB+ICBP: UK Biobank and International Consortium for Blood Pressure.

**Figure 2:. Estimated phenotypic PVE of baseline models fitted using non-linear ML and linear models.**
Estimated PVEs in the TOPMed test dataset for baseline model performance for prediction of SBP and DBP in the overall test dataset and stratified by self-reported race/ethnicity (White N = 10,877, Hispanic/Latino N = 3,831, Black N = 3,657, Asian N = 403 for DBP; White N = 10,823, Hispanic/Latino N = 3,877, Black N = 3,674, Asian N = 374 for SBP). The visualized 95% confidence intervals were computed as the 2.5% and 97.5% percentiles of the bootstrap distribution of the PVEs estimated over the test dataset. PVE: Percent variance explained. TOPMed: Trans-Omics in Precision Medicine project. SBP: systolic blood pressure. DBP: diastolic blood pressure.

**Figure 3:. Comparison of genetic and ensemble model performance in TOPMed test dataset.**
Panel a: Estimated PVEs in the TOPMed test dataset obtained by genetic models incorporating one or more PRSs according to the three complexity levels. Level 1: a single PRS based on the UKB+ICBP GWAS. Level 2: PRSs based on the UKB+ICBP GWAS based on seven p-value thresholds. Level 3: 21 PRSs, 7 PRSs based on each of the UKB+ICBP, MVP, and BBJ GWAS. PVE is reported for predicting residuals from the baseline model, where the baseline model was a non-linear ML model and only used non-genetic covariates. Panel b: Estimated PVEs in the TOPMed test dataset for ensemble model at the raw phenotypic level. PVEs are reported for models of SBP and DBP, in the overall test dataset and stratified by self-reported race/ethnicity (White N = 10,877, Hispanic/Latino N = 3,831, Black N = 3,657, Asian N = 403 for DBP; White N = 10,823, Hispanic/Latino N = 3,877, Black N = 3,674, Asian N = 374 for SBP). The visualized 95% confidence intervals were computed as the 2.5% and 97.5% percentiles of the bootstrap distribution of the PVEs estimated over the test dataset. PVE: Percent variance explained. TOPMed: Trans-Omics in Precision Medicine project. SBP: systolic blood pressure. DBP: diastolic blood pressure. PRS: polygenic risk score.

**Figure 4:. Integration of LASSO feature selection tool into the ensemble model workflow**
Panel a: The workflow of the ensemble model with the integration of the LASSO variable selection tool. To include local PRSs in the ensemble model while attempting to avoid overfitting, we added a LASSO selection step to the ensemble model development. As visualized, the residuals of the baseline model were used as the outcome in LASSO penalized regression with the local PRSs as features. LASSO substantially reduced the number of local PRSs (to 827 for SBP and 224 for DBP). The local PRSs selected by LASSO were then used as an input into the genetic model for prediction of the baseline residuals ${(\hat{r}}_{b})$ . Panel b: Genomic locations of local PRSs, calculated over predefined LD-regions, selected by LASSO for SBP and DBP. Panel c: Comparison between the estimated PVE in the TOPMed test dataset for ensemble model Level 3 using global PRSs and the ensemble model using Linear regression and local PRSs. PVEs are reported for models of SBP and DBP, in the overall test dataset and stratified by self-reported race/ethnicity (White N = 10,877, Hispanic/Latino N = 3,831, Black N = 3,657, Asian N = 403 for DBP; White N = 10,823, Hispanic/Latino N = 3,877, Black N = 3,674, Asian N = 374 for SBP). The visualized 95% confidence intervals were computed as the 2.5% and 97.5% percentiles of the bootstrap distribution of the PVEs estimated over the test dataset. BP: blood pressure. DBP: diastolic blood pressure. LASSO: least absolute shrinkage and selection operator. PRS: polygenic risk score. SBP: systolic blood pressure.

**Figure 5:. Application of the model trained on the TOPMed dataset to the MGB Biobank data.**
Panel a: The figure visualizes the workflow of the Ensemble model with the baseline being trained on the MGB Biobank dataset and application of the genetic models, trained on the TOPMed data, incorporating one or more PRSs according to the three model complexity levels. Level 1: a single PRS based on the UKB+ICBP GWAS. Level 2: seven PRSs based on the UKB+ICBP GWAS, difference p-value thresholds. Level 3: 21 PRSs, 7 PRSs based on each of the UKB+ICBP, MVP, and BBJ GWAS. Panel b: Estimated PVE in the MGBB test dataset for XGBoost genetic models fitted on the TOPMed dataset of three levels of complexity with baseline model fitted using TOPMed baseline model weights (top) and using MGBB baseline model weights (bottom). PVEs are shown for the performance in prediction of the second order of residuals for SBP and DBP phenotypes in the overall test dataset and stratified by race/ethnicity (White N = 7,985, Black N = 412, Asian N = 200). BP: blood pressure. MGBB: Mass General Brigham Biobank. PRS: polygenic risk score. TOPMed: Trans-Omics in Precision Medicine project.

See this image and copyright information in PMC

References

1. Torkamani A., Wineinger N.E., and Topol E.J., The personal and clinical utility of polygenic risk scores. Nat Rev Genet, 2018. 19(9): p. 581–590. - PubMed
1. Choi S.W., Mak T.S., and O’Reilly P.F., Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc, 2020. 15(9): p. 2759–2772. - PMC - PubMed
1. Ho D.S.W., et al., Machine Learning SNP Based Prediction for Precision Medicine. Frontiers in Genetics, 2019. 10. - PMC - PubMed
1. Elgart M., et al., Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol, 2022. 5(1): p. 856. - PMC - PubMed
1. Tibshirani R., Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996. 58(1): p. 267–288.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Machine learning models for blood pressure phenotypes combining multiple polygenic risk scores

Affiliations

Machine learning models for blood pressure phenotypes combining multiple polygenic risk scores

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources