Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 2;16(1):180.
doi: 10.1038/s41467-024-55636-6.

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Affiliations

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Chen Wang et al. Nat Commun. .

Abstract

Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest prediction R 2 and the resulting PRS yields the strongest correlation with progression prevalence.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Detailed workflow of GPS.
GPS combines CC GWAS data and EHR-based biobanks to construct PRS models for predicting the risk of preclinical → disease progression.
Fig. 2
Fig. 2. Prediction accuracy of different PRS models in simulations (200 causal variants).
All causal variants are shared between progression and case-control phenotypes in this simulation. The prediction accuracy is evaluated by the mean prediction R2 across 20 simulated replicates. The error bar indicates the standard deviation of prediction R2 across 20 simulation replicates. Each row represents different PRS models using the same baseline PRS method. MVL uses Lassosum as baseline framework, so it cannot accommodate alternative baseline PRS methods. To facilitate comparisons, we estimate the prediction R2 of MVL by repeating across the scenarios in different rows and taking the average. The sample size of the progression cohort is 500 in (A), 1000 in (B), 2000 in (C), and 3000 in (D). The number of causal variants is set as 200. gcor genetic correlation, Nprog sample size of biobank study of progression phenotype. Super-stacking models are not included here but are shown in Supplementary Fig. 3. Scenarios with different causal variants between case-control and progression phenotypes are given in Supplementary Fig. 1.
Fig. 3
Fig. 3. The association between PRS and the prevalence of RF positive → RA progressions in the All of Us data.
The All of Us data is not used to train genetic risk scores. The Pearson correlation coefficient (and corresponding p-values from two-sided t-test) between PRS and the progression prevalence at each decile in the All of Us data are labeled on the plot. The error bands represent 95% confidence intervals of fitted linear regression lines. MVL uses Lassosum as baseline framework. The prediction accuracy of MVL is obtained by repeating across the scenarios of different rows and taking the average. It is clear that GPS consistently yields stronger and more significant correlations between predicted and observed progression in the independent test dataset, which demonstrates improved accuracy. Super-stacking models are shown in Supplementary Fig. 5.
Fig. 4
Fig. 4. The association between PRS and the prevalence of ANA positive → SLE progressions in the All of Us data.
The All of Us data is not used to train genetic risk scores. The Pearson correlation coefficient (and corresponding p-values from two-sided t-test) between PRS and the progression prevalence at each decile in the All of Us data are labeled on the plot. The error bands represent 95% confidence intervals of fitted linear regression lines. MVL uses Lassosum as baseline framework. The prediction accuracy of MVL is obtained by repeating across the scenarios of different rows and taking the average. It is clear that GPS consistently yields stronger and more significant correlations between the predicted and observed progression prevalence in the independent test dataset, which demonstrates improved accuracy. Super-stacking models are shown in Supplementary Fig. 6.
Fig. 5
Fig. 5. Cumulative distributions of marginal association statistics testing the association with preclinical to disease progressions in the All of Us dataset.
We trained the progression risk scores in the BioVU biobank. We also performed GWAS, comparing preclinical to disease cases, in the All of Us data, which is not used in model training. For variants selected by GPS or the risk scores using CC samples only, we compare the distribution of the marginal χ2 statistics testing genetic associations with preclinical → disease progression. The cumulative distribution functions of the marginal χ2 statistics are plotted for A RF positive to RA progressions and B ANA positive to SLE progressions, for the variants selected by the risk scores. Two-sided Kolmogorov-Smirnov (KS) tests were performed to compare the distributions and the p-values are labeled on each subpanel. At each quantile, the variants selected by GPS are often more significantly associated with the progression phenotype compared to variants selected by risk scores based on CC studies. This comparison explains why GPS is more accurate for predicting preclinical to disease progressions. Cumulative distributions of marginal association statistics contrasting healthy control with preclinical disease are given in Supplementary Fig. 7.
Fig. 6
Fig. 6. PheWAS results for RA case-control and progression risk scores in UK Biobank.
A PheWAS results from CC-PRS of RA. B PheWAS results from GPS-PRS of RA. The y-axis represents the −log10(p-value) for each PheWAS code, derived using a two-sided Chi-square test after fitting a multivariate logistic regression model. The x-axis displays different PheWAS code categories. Each point corresponds to a specific PheWAS code, with downward and upward pointing triangles indicating negative and positive associations between disease status defined by the PheWAS code and the PRS, respectively.
Fig. 7
Fig. 7. PheWAS results for SLE case-control and progression risk scores in UK Biobank.
A PheWAS results for CC-PRS of SLE. B PheWAS results from GPS-PRS of SLE. The y-axis represents the −log10(p-value) for each PheWAS code, derived using a two-sided Chi-square test after fitting a multivariate logistic regression model. The x-axis displays different PheWAS code categories. Each point corresponds to a specific PheWAS code, with downward and upward-pointing triangles indicating negative and positive associations between the disease status defined by the PheWAS code and the PRS, respectively.

Similar articles

Cited by

References

    1. Greenblatt, H. K., Kim, H. A., Bettner, L. F. & Deane, K. D. Preclinical rheumatoid arthritis and rheumatoid arthritis prevention. Curr. Opin. Rheumatol.32, 289–296 (2020). - PMC - PubMed
    1. Frazzei, G., van Vollenhoven, R. F., de Jong, B. A., Siegelaar, S. E. & van Schaardenburg, D. Preclinical autoimmune disease: a comparison of rheumatoid arthritis, systemic lupus erythematosus, multiple sclerosis and type 1 diabetes. Front. Immunol.13, 899372 (2022). - PMC - PubMed
    1. Arbuckle, M. R. et al. Development of autoantibodies before the clinical onset of systemic lupus erythematosus. N. Engl. J. Med.349, 1526–1533 (2003). - PubMed
    1. Herman, C. R., Gill, H. K., Eng, J. & Fajardo, L. L. Screening for preclinical disease: test and disease characteristics. Am. J. Roentgenol.179, 825–831 (2002). - PubMed
    1. Aho, K., Heliövaara, M., Maatela, J., Tuomi, T. & Palosuo, T. Rheumatoid factors antedating clinical rheumatoid arthritis. J. Rheumatol.18, 1282–1284 (1991). - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources