Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 22;5(1):856.
doi: 10.1038/s42003-022-03812-z.

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Collaborators, Affiliations

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Michael Elgart et al. Commun Biol. .

Abstract

Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. G.L. is a full-time employee of Valo Health, a technology company, but this work was not conducted in that position and is not relevant to that position. All other co-authors declare no competing interests.

Figures

Fig. 1
Fig. 1. PRSice, LDpred2, and Lassosum2 Linear PRS results.
Best-performing PRSice (gray) compared to best-performing LDpred2 (orange) and best Lassosum2 (brown) across the hyperparameters tuned using the training data.
Fig. 2
Fig. 2. Flow chart of ensemble model structure.
The model relies on jointly training the LASSO and XGBoost model to identify the optimal value for the L1 regularization parameter and the number of boosting steps. CV indicates cross-validation, α refers to the regularization parameter, and Ɵ is the number of boosted trees for XGBoost. The optimal values for these hyperparameters were selected using threefold CV for the mean squared error of the XGBoost model.
Fig. 3
Fig. 3. Nonlinear model consistently outperforms linear ones for prediction of multiple complex phenotypes in multi-ethnic dataset.
Linear (PRS-pink), linear-regularized (LASSO—teal), and nonlinear (XGBoost—gray, purple) models were employed to predict the harmonized phenotypes from SNP data from TOPMed following adjustment for covariates. Two versions of the XGBoost algorithm are shown with the first model employing only the SNPs as features (gray; XGBoost alone) and a second model which had the PRS as one of the features as well (XGBoost with PRS). The LASSO algorithm (teal) was trained on the same set of SNPs as the XGBoost. The inset (gray) depicts estimated heritability for same phenotypes in the same database using the REML approach with error bars of 95% confidence intervals estimated through restricted maximum-likelihood estimate.
Fig. 4
Fig. 4. Model performance differ by group, with XGBoost consistently outperforming PRS.
Performance of the PRS (pink) and XGBoost+PRS (purple) models trained on the combined dataset when applied to the prediction of the 5 phenotypes in separate race/ethnicities. Panels a, b, and c refer to White, Black and Hispanic/Latino groups, respectively.
Fig. 5
Fig. 5. Multi-ethnic XGBoost model performs on par with the race/ethnic-specific models.
XGBoost with PRS models were trained either on the combined dataset containing all participants, (pink) or on each race/ethnic group separately (teal, gray and purple). The models were then evaluated on each of the groups (a Black, b Hispanic/Latino, and c White).

References

    1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. - DOI - PubMed
    1. Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15:2759–2772. doi: 10.1038/s41596-020-0353-1. - DOI - PMC - PubMed
    1. Hemani G, et al. Detection and replication of epistasis influencing transcription in humans. Nature. 2014;508:249–253. doi: 10.1038/nature13005. - DOI - PMC - PubMed
    1. Jiang Y, Schmidt RH, Reif JC. Haplotype-based genome-wide prediction models exploit local epistatic interactions among markers. G3. 2018;8:1687–1699. doi: 10.1534/g3.117.300548. - DOI - PMC - PubMed
    1. Miller AK, et al. A novel mapping strategy utilizing mouse chromosome substitution strains identifies multiple epistatic interactions that regulate complex traits. G3. 2020;10:4553–4563. doi: 10.1534/g3.120.401824. - DOI - PMC - PubMed

Publication types