. 2025 Jan 2;16(1):180.

doi: 10.1038/s41467-024-55636-6.

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Chen Wang^#^{1

2}, Havell Markus^#¹, Avantika R Diwadkar^{1

2}, Chachrit Khunsriraksakul¹, Laura Carrel³, Bingshan Li⁴, Xue Zhong⁵, Xingyan Wang², Xiaowei Zhan^{6

7

8}, Galen T Foulke^{2

9}, Nancy J Olsen¹⁰, Dajiang J Liu^{11

12}, Bibo Jiang¹³

Affiliations

¹ Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA.
² Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA.
³ Department of Biochemistry and Molecular Biology, College of Medicine, Penn State University, Hershey, PA, USA.
⁴ Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN, USA.
⁵ Department of Medicine, Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
⁶ Department of Statistical Science, Southern Methodist University, Dallas, TX, USA.
⁷ Department of Population and Data Sciences, Quantitative Biomedical Research Center, Southwestern Medical Center University of Texas, Dallas, TX, USA.
⁸ Center for Genetics of Host Defense, Southwestern Medical Center University of Texas, Dallas, TX, USA.
⁹ Department of Dermatology, College of Medicine, Penn State University, Hershey, PA, USA.
¹⁰ Department of Medicine, College of Medicine, Penn State University, Hershey, PA, USA.
¹¹ Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA. dajiang.liu@psu.edu.
¹² Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA. dajiang.liu@psu.edu.
¹³ Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA. bjiang@phs.psu.edu.

^# Contributed equally.

PMID: 39747168
PMCID: PMC11695684
DOI: 10.1038/s41467-024-55636-6

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Chen Wang et al. Nat Commun. 2025.

. 2025 Jan 2;16(1):180.

doi: 10.1038/s41467-024-55636-6.

Authors

Affiliations

¹ Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA.
² Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA.
³ Department of Biochemistry and Molecular Biology, College of Medicine, Penn State University, Hershey, PA, USA.
⁴ Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN, USA.
⁵ Department of Medicine, Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
⁶ Department of Statistical Science, Southern Methodist University, Dallas, TX, USA.
⁷ Department of Population and Data Sciences, Quantitative Biomedical Research Center, Southwestern Medical Center University of Texas, Dallas, TX, USA.
⁸ Center for Genetics of Host Defense, Southwestern Medical Center University of Texas, Dallas, TX, USA.
⁹ Department of Dermatology, College of Medicine, Penn State University, Hershey, PA, USA.
¹⁰ Department of Medicine, College of Medicine, Penn State University, Hershey, PA, USA.
¹¹ Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA. dajiang.liu@psu.edu.
¹² Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA. dajiang.liu@psu.edu.
¹³ Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA. bjiang@phs.psu.edu.

^# Contributed equally.

PMID: 39747168
PMCID: PMC11695684
DOI: 10.1038/s41467-024-55636-6

Abstract

Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest prediction $R^{2}$ and the resulting PRS yields the strongest correlation with progression prevalence.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Detailed workflow of GPS.**
GPS combines CC GWAS data and EHR-based biobanks to construct PRS models for predicting the risk of preclinical → disease progression.

**Fig. 2. Prediction accuracy of different PRS models in simulations (200 causal variants).**
All causal variants are shared between progression and case-control phenotypes in this simulation. The prediction accuracy is evaluated by the mean prediction $R^{2}$ across 20 simulated replicates. The error bar indicates the standard deviation of prediction $R^{2}$ across 20 simulation replicates. Each row represents different PRS models using the same baseline PRS method. MVL uses Lassosum as baseline framework, so it cannot accommodate alternative baseline PRS methods. To facilitate comparisons, we estimate the prediction $R^{2}$ of MVL by repeating across the scenarios in different rows and taking the average. The sample size of the progression cohort is 500 in (A), 1000 in (B), 2000 in (C), and 3000 in (D). The number of causal variants is set as 200. gcor genetic correlation, Nprog sample size of biobank study of progression phenotype. Super-stacking models are not included here but are shown in Supplementary Fig. 3. Scenarios with different causal variants between case-control and progression phenotypes are given in Supplementary Fig. 1.

**Fig. 3. The association between PRS and the prevalence of RF positive → RA progressions in the *All of Us* data.**
The *All of Us* data is not used to train genetic risk scores. The Pearson correlation coefficient (and corresponding p-values from two-sided t-test) between PRS and the progression prevalence at each decile in the *All of Us* data are labeled on the plot. The error bands represent 95% confidence intervals of fitted linear regression lines. MVL uses Lassosum as baseline framework. The prediction accuracy of MVL is obtained by repeating across the scenarios of different rows and taking the average. It is clear that GPS consistently yields stronger and more significant correlations between predicted and observed progression in the independent test dataset, which demonstrates improved accuracy. Super-stacking models are shown in Supplementary Fig. 5.

**Fig. 4. The association between PRS and the prevalence of ANA positive → SLE progressions in the *All of Us* data.**
The *All of Us* data is not used to train genetic risk scores. The Pearson correlation coefficient (and corresponding p-values from two-sided t-test) between PRS and the progression prevalence at each decile in the *All of Us* data are labeled on the plot. The error bands represent 95% confidence intervals of fitted linear regression lines. MVL uses Lassosum as baseline framework. The prediction accuracy of MVL is obtained by repeating across the scenarios of different rows and taking the average. It is clear that GPS consistently yields stronger and more significant correlations between the predicted and observed progression prevalence in the independent test dataset, which demonstrates improved accuracy. Super-stacking models are shown in Supplementary Fig. 6.

**Fig. 5. Cumulative distributions of marginal association statistics testing the association with preclinical to disease progressions in the *All of Us* dataset.**
We trained the progression risk scores in the BioVU biobank. We also performed GWAS, comparing preclinical to disease cases, in the *All of Us* data, which is not used in model training. For variants selected by GPS or the risk scores using CC samples only, we compare the distribution of the marginal $χ^{2}$ statistics testing genetic associations with preclinical → disease progression. The cumulative distribution functions of the marginal $χ^{2}$ statistics are plotted for A RF positive to RA progressions and B ANA positive to SLE progressions, for the variants selected by the risk scores. Two-sided Kolmogorov-Smirnov (KS) tests were performed to compare the distributions and the p-values are labeled on each subpanel. At each quantile, the variants selected by GPS are often more significantly associated with the progression phenotype compared to variants selected by risk scores based on CC studies. This comparison explains why GPS is more accurate for predicting preclinical to disease progressions. Cumulative distributions of marginal association statistics contrasting healthy control with preclinical disease are given in Supplementary Fig. 7.

**Fig. 6. PheWAS results for RA case-control and progression risk scores in UK Biobank.**
A PheWAS results from CC-PRS of RA. B PheWAS results from GPS-PRS of RA. The y-axis represents the −log10(p-value) for each PheWAS code, derived using a two-sided Chi-square test after fitting a multivariate logistic regression model. The x-axis displays different PheWAS code categories. Each point corresponds to a specific PheWAS code, with downward and upward pointing triangles indicating negative and positive associations between disease status defined by the PheWAS code and the PRS, respectively.

**Fig. 7. PheWAS results for SLE case-control and progression risk scores in UK Biobank.**
A PheWAS results for CC-PRS of SLE. B PheWAS results from GPS-PRS of SLE. The y-axis represents the −log10(p-value) for each PheWAS code, derived using a two-sided Chi-square test after fitting a multivariate logistic regression model. The x-axis displays different PheWAS code categories. Each point corresponds to a specific PheWAS code, with downward and upward-pointing triangles indicating negative and positive associations between the disease status defined by the PheWAS code and the PRS, respectively.

See this image and copyright information in PMC

Cited by

Medical laboratory data-based models: opportunities, obstacles, and solutions.
Meng J, Wu M, Shi F, Xie Y, Wang H, Guo Y. Meng J, et al. J Transl Med. 2025 Jul 24;23(1):823. doi: 10.1186/s12967-025-06802-x. J Transl Med. 2025. PMID: 40707923 Free PMC article. Review.

References

1. Greenblatt, H. K., Kim, H. A., Bettner, L. F. & Deane, K. D. Preclinical rheumatoid arthritis and rheumatoid arthritis prevention. Curr. Opin. Rheumatol.32, 289–296 (2020). - PMC - PubMed
1. Frazzei, G., van Vollenhoven, R. F., de Jong, B. A., Siegelaar, S. E. & van Schaardenburg, D. Preclinical autoimmune disease: a comparison of rheumatoid arthritis, systemic lupus erythematosus, multiple sclerosis and type 1 diabetes. Front. Immunol.13, 899372 (2022). - PMC - PubMed
1. Arbuckle, M. R. et al. Development of autoantibodies before the clinical onset of systemic lupus erythematosus. N. Engl. J. Med.349, 1526–1533 (2003). - PubMed
1. Herman, C. R., Gill, H. K., Eng, J. & Fajardo, L. L. Screening for preclinical disease: test and disease characteristics. Am. J. Roentgenol.179, 825–831 (2002). - PubMed
1. Aho, K., Heliövaara, M., Maatela, J., Tuomi, T. & Palosuo, T. Rheumatoid factors antedating clinical rheumatoid arthritis. J. Rheumatol.18, 1282–1284 (1991). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Affiliations

Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical