Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 15;15(1):8891.
doi: 10.1038/s41467-024-53333-y.

Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score

Affiliations

Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score

Robert Chen et al. Nat Commun. .

Abstract

Identifying genetic drivers of chronic diseases is necessary for drug discovery. Here, we develop a machine learning-assisted genetic priority score, which we call ML-GPS, that incorporates genetic associations with predicted disease phenotypes to enhance target discovery. First, we construct gradient boosting models to predict 112 chronic disease phecodes in the UK Biobank and analyze associations of predicted and observed phenotypes with common, rare, and ultra-rare variants to model the allelic series. We integrate these associations with existing evidence using gradient boosting with continuous feature encoding to construct ML-GPS, training it to predict drug indications in Open Targets and externally testing it in SIDER. We then generate ML-GPS predictions for 2,362,636 gene-phecode pairs. We find that the use of predicted phenotypes, which identify substantially more genetic associations than observed phenotypes across the allele frequency spectrum, significantly improves the performance of ML-GPS. ML-GPS increases coverage of drug targets, with the top 1% of all scores providing support for 15,077 gene-phecode pairs that previously had no support. ML-GPS can also identify well-known target-disease relationships, promising targets without indicated drugs, and targets for several drugs in clinical trials, including LRRK2 inhibitors for Parkinson's disease and olpasiran for cardiovascular disease.

PubMed Disclaimer

Conflict of interest statement

RD reported being a scientific co-founder, consultant and equity holder for Pensieve Health (pending) and being a consultant for Variant Bio, all not related to this work. DNC and MM acknowledge receipt of funding from Qiagen Ltd through a License agreement with Cardiff University, which is relevant to the use of HGMD Professional in this work. All other authors have no competing interests.

Figures

Fig. 1
Fig. 1. Workflow for constructing ML-GPS.
Workflow for constructing ML-GPS, including machine learning models to predict phecode diagnoses across 112 phecodes, genetic association analyses using both observed and predicted phenotypes, and integration of genetic associations with existing genetic evidence.
Fig. 2
Fig. 2. Generation of and genetic association analyses with predicted phenotypes.
a Mean AUROCs (blue) of final models for 112 of 386 phecodes meeting performance thresholds (AUROC ≥ 0.70, AUPRC ≥ phecode prevalence). Numbers at the top of the graph indicate the number of phecodes in each phecode category; each phecode is represented as a grey dot in the background. AUROCs were calculated among 183,021 UK Biobank participants with GP records (see “Study sample” in the Methods section). b Number of genes identified by P (blue), B (orange), and C (green) in common, rare, and ultra-rare variant analyses across 112 phecodes. For common and rare variant analysis, “gene” refers to any gene with a significant variant, whereas for ultra-rare variant analyses, “gene” refers to any gene with a significant test. c Odds ratios for drug indications in Open Targets with 13 variables included in ML-GPS. Note that these odds ratios are for binary encoded variables, whereas ML-GPS uses continuous encoded variables as features (see “Genetic priority scores” in the Methods section). d Odds ratios for drug indications in Open Targets with B-P and C-P; these represent genes identified by B and C not identified by P, respectively. Note that B-P and C-P are not ML-GPS features and are included solely for comparison. Plots c,d represent logistic regression analyses of 112,274 gene-phecode pairs in Open Targets, of which 4116 had a drug indication. Plots a, c and d show means with 95% confidence intervals. Source data are provided as a Source Data file. Abbreviations: AUROC (area under the receiver operating characteristic curve); AUPRC (area under the precision-recall curve); P (observed case/control); B (binarized model probabilities/predicted case-control); C (continuous model probabilities).
Fig. 3
Fig. 3. Performance of genetic priority scores with different architectures.
a AUPRC for drug indication in Open Targets (holdout testing) and SIDER (external testing). Grey dotted lines show the proportion of gene-phecode pairs with indications in each dataset. b Odds ratios per standard deviation increase in score for any drug indication and separately for drug indications in specific clinical trial phases in Open Targets. Brackets denote the number of gene-phecode pairs with drug indications in each phase. c,d Odds ratios of drug indications for gene-phecode pairs in the top X score percentiles compared to pairs in the 0-50 percentiles in Open Targets (c) and SIDER (d). Plots a–c represent analyses of 112,274 gene-phecode pairs in Open Targets, of which 4116 had a drug indication. Plots a and d represent analyses of 58,674 gene-phecode pairs in SIDER, of which 1883 had a drug indication. All plots show means with 95% confidence intervals. Source data are provided as a Source Data file. Abbreviations: AUPRC (area under the precision-recall curve); LR (logistic regression); GB (gradient boosting); CE (continuous encoding); L2G (locus-to-gene); P (observed case/control); B (binarized model probabilities/predicted case control); C (continuous model probabilities).
Fig. 4
Fig. 4. Performance of genetic priority scores with different features.
a AUPRC for drug indication in Open Targets (holdout testing) and SIDER (external testing). Grey dotted lines show the proportion of gene-phecode pairs with indications in each dataset. b Odds ratios per standard deviation increase in score for any drug indication and separately for drug indications in specific clinical trial phases in Open Targets. Brackets denote the number of gene-phecode pairs with drug indications in each phase. c, d Odds ratios of drug indications for gene-phecode pairs in the top X score percentiles compared to pairs in the 0-50 percentiles in Open Targets (c) and SIDER (d). Plots a–c represent analyses of 112,274 gene-phecode pairs in Open Targets, of which 4116 had a drug indication. Plots a and d represent analyses of 58,674 gene-phecode pairs in SIDER, of which 1883 had a drug indication. All plots show means with 95% confidence intervals. Source data are provided as a Source Data file. Abbreviations: AUPRC (area under the precision-recall curve); LR (logistic regression); GB (gradient boosting); CE (continuous encoding); L2G (locus-to-gene); P (observed case/control); B (binarized model probabilities/predicted case control); C (continuous model probabilities).
Fig. 5
Fig. 5. Performance of direction-of-effect (DOE) genetic priority scores with different features for activator drug indications.
a AUPRC for activator drug indications in Open Targets (holdout testing) and SIDER (external testing). Grey dotted lines show the proportion of gene-phecode pairs with indications in each dataset. Inhibitor drug indications were set to 0 (no drug indication). b Odds ratios per standard deviation increase in score for any activator drug indication and separately for drug indications in specific clinical trial phases in Open Targets. Brackets denote the number of gene-phecode pairs with drug indications in each phase. c,d Odds ratios for activator drug indications for gene-phecode pairs in the top X score percentiles compared to pairs in the 0–50 percentiles in Open Targets (c) and SIDER (d). Plots a–c represent analyses of 112,274 gene-phecode pairs in Open Targets, of which 890 had an activator drug indication. Plots a and d represent analyses of 58,674 gene-phecode pairs in SIDER, of which 364 had an activator drug indication. All plots show means with 95% confidence intervals. Source data are provided as a Source Data file. Abbreviations: AUPRC (area under the precision-recall curve); L2G (locus-to-gene); P (observed case/control); B (binarized model probabilities/predicted case control); C (continuous model probabilities).
Fig. 6
Fig. 6. Performance of direction-of-effect (DOE) genetic priority scores with different features for inhibitor drug indications.
a AUPRC for inhibitor drug indications in Open Targets (holdout testing) and SIDER (external testing). Grey dotted lines show the proportion of gene-phecode pairs with indications in each dataset. Activator drug indications were set to 0 (no drug indication). b Odds ratios per standard deviation increase in score for any inhibitor drug indication and separately for drug indications in specific clinical trial phases in Open Targets. Brackets denote the number of gene-phecode pairs with drug indications in each phase. c,d Odds ratios for inhibitor drug indications for gene-phecode pairs in the top X score percentiles compared to pairs in the 0-50 percentiles in Open Targets (c) and SIDER (d). Plots a–c represent analyses of 112,274 gene-phecode pairs in Open Targets, of which 3019 had an inhibitor drug indication. Plots a and d represent analyses of 58,674 gene-phecode pairs in SIDER, of which 1288 had an inhibitor drug indication. All plots show means with 95% confidence intervals. Source data are provided as a Source Data file. Abbreviations: AUPRC (area under the precision-recall curve); L2G (locus-to-gene); P (observed case/control); B (binarized model probabilities/predicted case control); C (continuous model probabilities).
Fig. 7
Fig. 7. Analysis of targets prioritized by ML-GPS.
a, b Direct comparison between scores for ML-GPS versus L2G + Clinical + P models for gene-phecode pairs with a drug indication (a) or without a drug indication (b) in either Open Targets or SIDER. c Number of gene-phecode pairs and the proportion of these pairs with drug indications among ML-GPS and L2G + Clinical + P scores <99th percentile versus ≥ 99th percentile. d Number of gene-phecode pairs and the proportion of these pairs with drug indications among ML-GPS scores <99th percentile versus ≥ 99th percentile and approximated original GPS scores = 0 versus > 0. e,f Proportion of gene-phecode pairs in each score bin with the specified score increase (from L2G + Clinical + P to ML-GPS) with direct (e) or indirect (f) target-disease associations in Open Targets. g Gene set-phecode combinations with the highest normalized enrichment score for ML-GPS. Source data are provided as a Source Data file. Abbreviations: L2G (locus-to-gene); P (observed case control).

References

    1. Vos, T. et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet396, 1204–1222 (2020). - PMC - PubMed
    1. Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov.12, 581–594 (2013). - PubMed
    1. Loos, R. J. F. 15 years of genome-wide association studies and no signs of slowing down. Nat. Commun.11, 5900 (2020). - PMC - PubMed
    1. Finan, C. et al. The druggable genome and support for target identification and validation in drug development. Sci. Transl. Med.9, eaag1166 (2017). - PMC - PubMed
    1. Rusina, P. V. et al. Genetic support for FDA-approved drugs over the past decade. Nat. Rev. Drug Discov.22, 864–864 (2023). - PubMed

Publication types

LinkOut - more resources