Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

doi:10.1038/s41467-024-47472-5

. 2024 Apr 12;15(1):3168.

doi: 10.1038/s41467-024-47472-5.

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Ting-Hsuan Sun¹, Chia-Chun Wang¹, Ting-Yuan Liu², Shih-Chang Lo¹, Yi-Xuan Huang¹, Shang-Yu Chien¹, Yu-De Chu¹, Fuu-Jen Tsai^#^{3

4

5

6}, Kai-Cheng Hsu^#^{7

8

9}

Affiliations

¹ Artificial Intelligence Center, China Medical University Hospital, Taichung, 40447, Taiwan.
² Million-person Precision Medicine Initiative, Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan.
³ Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan. 000704@tool.caaumed.org.tw.
⁴ School of Chinese Medicine, China Medical University, Taichung, 40402, Taiwan. 000704@tool.caaumed.org.tw.
⁵ Division of Pediatric Genetics, Children's Hospital of China Medical University, Taichung, 40447, Taiwan. 000704@tool.caaumed.org.tw.
⁶ Department of Biotechnology and Bioinformatics, Asia University, Taichung, 41354, Taiwan. 000704@tool.caaumed.org.tw.
⁷ Artificial Intelligence Center, China Medical University Hospital, Taichung, 40447, Taiwan. kaichenghsu66@gmail.com.
⁸ Department of Neurology, China Medical University Hospital, Taichung, 40447, Taiwan. kaichenghsu66@gmail.com.
⁹ Department of Medicine, China Medical University, Taichung, 40402, Taiwan. kaichenghsu66@gmail.com.

^# Contributed equally.

PMID: 38609356
PMCID: PMC11014845
DOI: 10.1038/s41467-024-47472-5

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Ting-Hsuan Sun et al. Nat Commun. 2024.

. 2024 Apr 12;15(1):3168.

doi: 10.1038/s41467-024-47472-5.

Authors

Ting-Hsuan Sun¹, Chia-Chun Wang¹, Ting-Yuan Liu², Shih-Chang Lo¹, Yi-Xuan Huang¹, Shang-Yu Chien¹, Yu-De Chu¹, Fuu-Jen Tsai^#^{3

4

5

6}, Kai-Cheng Hsu^#^{7

8

9}

Affiliations

¹ Artificial Intelligence Center, China Medical University Hospital, Taichung, 40447, Taiwan.
² Million-person Precision Medicine Initiative, Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan.
³ Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan. 000704@tool.caaumed.org.tw.
⁴ School of Chinese Medicine, China Medical University, Taichung, 40402, Taiwan. 000704@tool.caaumed.org.tw.
⁵ Division of Pediatric Genetics, Children's Hospital of China Medical University, Taichung, 40447, Taiwan. 000704@tool.caaumed.org.tw.
⁶ Department of Biotechnology and Bioinformatics, Asia University, Taichung, 41354, Taiwan. 000704@tool.caaumed.org.tw.
⁷ Artificial Intelligence Center, China Medical University Hospital, Taichung, 40447, Taiwan. kaichenghsu66@gmail.com.
⁸ Department of Neurology, China Medical University Hospital, Taichung, 40447, Taiwan. kaichenghsu66@gmail.com.
⁹ Department of Medicine, China Medical University, Taichung, 40402, Taiwan. kaichenghsu66@gmail.com.

^# Contributed equally.

PMID: 38609356
PMCID: PMC11014845
DOI: 10.1038/s41467-024-47472-5

Abstract

Polygenic scores estimate genetic susceptibility to diseases. We systematically calculated polygenic scores across 457 phenotypes using genotyping array data from China Medical University Hospital. Logistic regression models assessed polygenic scores' ability to predict disease traits. The polygenic score model with the highest accuracy, based on maximal area under the receiver operating characteristic curve (AUC), is provided on the GeneAnaBase website of the hospital. Our findings indicate 49 phenotypes with AUC greater than 0.6, predominantly linked to endocrine and metabolic diseases. Notably, hyperplasia of the prostate exhibited the highest disease prediction ability (P value = 1.01 × 10^-19, AUC = 0.874), highlighting the potential of these polygenic scores in preventive medicine and diagnosis. This study offers a comprehensive evaluation of polygenic scores performance across diverse human traits, identifying promising applications for precision medicine and personalized healthcare, thereby inspiring further research and development in this field.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests. All authors declare that they have no known competing financial interests or non-financial interests that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1. Distribution of performance measurements and the number of individuals in the Polygenic Score Catalog.**
A Distribution of performance measurement records. B Distribution of AUC values and covariate usage across different ancestry cohorts. In A and B The orange plot represents records in the PGS catalog that do not considerate of covariates; the blue plot represents records that considerate of covariates. C Comparison of the distribution of AUC values between AUC and different ancestry cohorts. The orange box represents the AUC values in the CMUH model; the blue band represents the AUC values recorded from the PGS catalog. D Distribution of the number of individuals at different process stages. The blue plot represents the PGS record used before the initial screening step. The green plot represents the PGS record used after the initial screening step. The orange plot represents the PGS record used for optimized model. In C and D, the box represents the interquartile range (IQR), which spans from the 25th percentile (Q1) to the 75th percentile (Q3) of the data. The bottom and top edges of the box represent the smallest observation and the largest observation excluding outliers. The line inside the box represents the median (50th percentile) of the data. As for the violin plot, a smoothed kernel density estimate of the data distribution within each group is displayed. The bottom and top edges display the minimum and maximum values of the data. The two-sided Wilcoxon rank-sum test was used to calculate the P value. Bold text indicates that the P value < $1 \times 10^{- 5}$ E Cumulative distribution of the number of individuals at each process stage. The blue line represents the PGS record used before the initial screening step. The green line represents the PGS record used after the initial screening step. The orange line represents the PGS record used for optimized model.

**Fig. 2. Comparison of model performance with different covariate inclusion strategies.**
A Changes in AUC across 457 phenotypes, sorted by AUC achieved by PGS models (The light gray dotted line represents AUC = 0.6). B Number of phenotypes exhibiting the AUC trend for the four covariate inclusion strategies. The red series represents the model trained with PGS, sex, age, and the first four principal components, which performed the best. The green series represents the model trained with PGS, sex, and age, which performed the best. The blue series represents the model trained with PGS alone, which performed the best. The gray series represents the model trained with sex and age, which performed the best.

**Fig. 3. Sample prevalence of the disease in CMUH correlation comparison.**
A Association between sample prevalence rate and number of SNPs used for PGS calculations. B) Association between sample prevalence rate and P values obtained from the Wilcoxon rank sum test of PGS distributions between case and control populations. (The red dotted line represents P values = $2.5 \times 10^{- 6}$ ) C Association between the sample prevalence rate and AUC values for 457 phenotypes. (The red dotted line represents AUC = 0.6) In A–C, A linear regression line was plotted and the confidence interval around the regression line was set to 95%. Pearson correlation coefficient (r) is a measure of the strength and direction of the linear relationship between two variables, ranging from −1 to 1. The P value is the probability of obtaining the observed correlation coefficient with the confidence interval is set to 95%. D Classification of diseases (n = phenotypes counted with P values less than $2.5 \times 10^{- 6}$ /total phenotypes; The light gray dotted line represents P values = $2.5 \times 10^{- 6}$ ).

**Fig. 4. Differential relationship between PGS performance and disease prevalence.**
A Association between P values obtained from the Wilcoxon rank sum test and AUC values of 457 phenotype−PGS pairs of traits. A linear regression line was plotted and the confidence interval around the regression line was set to 95%. Pearson correlation coefficient (r) is a measure of the strength and direction of the linear relationship between two variables, ranging from −1 to 1. The P value is the probability of obtaining the observed correlation coefficient with the confidence interval is set to 95%. B AUC distribution of disease categories. C PGS distribution of patients with oral aphthae (n = patient number) and the relationship between PGS percentiles and patient prevalence. D PGS distribution of patients with prostate hyperplasia (n = patient number) and the relationship between PGS percentiles and patient prevalence. In C and D, the box represents the interquartile range (IQR), which spans from the 25th percentile (Q1) to the 75th percentile (Q3) of the data. The bottom and top edges of the box represent the smallest observation and the largest observation excluding outliers. The line inside the box represents the median (50th percentile) of the data. Observations outside this range are considered outliers and are plotted individually.

**Fig. 5. Data collection and processing workflow for data from the China Medical University Hospital and the PGS Catalog.**
In the figure, n is the number of subjects included in the analysis and m is the number of phenotype−PGS pairs.

**Fig. 6. Calculation, measurement, and display of the polygenic risk score model evaluation results.**
The entire process can be divided into four parts, namely PGS construction, model development, model evaluation, and drawing evaluation diagrams on CMUH GeneAnaBase.

See this image and copyright information in PMC

Cited by

Prediction of risk for isolated incomplete lateral meniscal injury using a dynamic nomogram based on MRI-derived anatomic radiomics and physical activity: a proof-of-concept study in 3PM-guided management.
Xie C, Chen J, Chen H, Zuo Z, Li Y, Lin L. Xie C, et al. EPMA J. 2025 Jan 27;16(1):199-215. doi: 10.1007/s13167-025-00399-3. eCollection 2025 Mar. EPMA J. 2025. PMID: 39991097
Diversity and longitudinal records: Genetic architecture of disease associations and polygenic risk in the Taiwanese Han population.
Liu TY, Lu HF, Chen YC, Liao CC, Lin YJ, Yang JS, Liao WL, Lin WD, Chen SY, Huang YC, Lin WY, Liu YH, Hsu KC, Chang SS, Chen HD, Chou YP, Chang JG, Wang CH, Chang CT, Huang CM, Yeo KJ, Wang TY, Yeh CC, Chen JH, Huang CP, Lai HC, Chen RH, Lin HJ, Wu PY, Wang JY, Kuo CC, Cho DY, Tsai CH, Tsai FJ. Liu TY, et al. Sci Adv. 2025 Jun 6;11(23):eadt0539. doi: 10.1126/sciadv.adt0539. Epub 2025 Jun 4. Sci Adv. 2025. PMID: 40465716 Free PMC article.
Predictive capabilities of polygenic scores in an East-Asian population-based cohort: the Singapore Chinese health study.
Chang X, Shih CC, Chen J, Lee AS, Tan P, Wang L, Liu J, Li J, Yuan JM, Khor CC, Koh WP, Dorajoo R. Chang X, et al. Commun Biol. 2025 Aug 15;8(1):1228. doi: 10.1038/s42003-025-08675-8. Commun Biol. 2025. PMID: 40817126 Free PMC article.
Discovery and prioritization of genetic determinants of kidney function in 297,355 individuals from Taiwan and Japan.
Chen HL, Chiang HY, Chang DR, Cheng CF, Wang CCN, Lu TP, Lee CY, Chattopadhyay A, Lin YT, Lin CC, Yu PT, Huang CF, Lin CH, Yeh HC, Ting IW, Tsai HK, Chuang EY, Tin A, Tsai FJ, Kuo CC. Chen HL, et al. Nat Commun. 2024 Oct 29;15(1):9317. doi: 10.1038/s41467-024-53516-7. Nat Commun. 2024. PMID: 39472450 Free PMC article.

References

1. Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12:44. doi: 10.1186/s13073-020-00742-5. - DOI - PMC - PubMed
1. Lambert SA, Abraham G, Inouye M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 2019;28:R133–R142. doi: 10.1093/hmg/ddz187. - DOI - PubMed
1. Visscher PM, et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. - DOI - PMC - PubMed
1. Pergament E, et al. Single-nucleotide polymorphism-based noninvasive prenatal screening in a high-risk and low-risk cohort. Obstet. Gynecol. 2014;124:210–218. doi: 10.1097/AOG.0000000000000363. - DOI - PMC - PubMed
1. Conran CA, et al. Population-standardized genetic risk score: the SNP-based method of choice for inherited risk assessment of prostate cancer. Asian J. Androl. 2016;18:520–524. doi: 10.4103/1008-682X.179527. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

MOHW112-TDU-B-212-144004/Ministry of Health and Welfare, Taiwan | Health Promotion Administration, Ministry of Health and Welfare (Health Promotion Administration of the Taiwan Ministry of Health and Welfare)

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

[1] Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12:44. doi: 10.1186/s13073-020-00742-5. - DOI - PMC - PubMed

[2] Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12:44. doi: 10.1186/s13073-020-00742-5. - DOI - PMC - PubMed

[3] Lambert SA, Abraham G, Inouye M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 2019;28:R133–R142. doi: 10.1093/hmg/ddz187. - DOI - PubMed

[4] Lambert SA, Abraham G, Inouye M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 2019;28:R133–R142. doi: 10.1093/hmg/ddz187. - DOI - PubMed

[5] Visscher PM, et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. - DOI - PMC - PubMed

[6] Visscher PM, et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. - DOI - PMC - PubMed

[7] Pergament E, et al. Single-nucleotide polymorphism-based noninvasive prenatal screening in a high-risk and low-risk cohort. Obstet. Gynecol. 2014;124:210–218. doi: 10.1097/AOG.0000000000000363. - DOI - PMC - PubMed

[8] Pergament E, et al. Single-nucleotide polymorphism-based noninvasive prenatal screening in a high-risk and low-risk cohort. Obstet. Gynecol. 2014;124:210–218. doi: 10.1097/AOG.0000000000000363. - DOI - PMC - PubMed

[9] Conran CA, et al. Population-standardized genetic risk score: the SNP-based method of choice for inherited risk assessment of prostate cancer. Asian J. Androl. 2016;18:520–524. doi: 10.4103/1008-682X.179527. - DOI - PMC - PubMed

[10] Conran CA, et al. Population-standardized genetic risk score: the SNP-based method of choice for inherited risk assessment of prostate cancer. Asian J. Androl. 2016;18:520–524. doi: 10.4103/1008-682X.179527. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Affiliations

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical