Valid inference for machine learning-assisted genome-wide association studies
- PMID: 39349818
- PMCID: PMC11972620
- DOI: 10.1038/s41588-024-01934-0
Valid inference for machine learning-assisted genome-wide association studies
Abstract
Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.
© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.
Conflict of interest statement
Competing interests
The authors declare no competing interests.
Figures






Similar articles
-
Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks.Nat Genet. 2024 Jul;56(7):1527-1536. doi: 10.1038/s41588-024-01793-9. Epub 2024 Jun 13. Nat Genet. 2024. PMID: 38872030 Free PMC article.
-
How powerful are summary-based methods for identifying expression-trait associations under different genetic architectures?Pac Symp Biocomput. 2018;23:228-239. Pac Symp Biocomput. 2018. PMID: 29218884 Free PMC article.
-
GenToS: Use of Orthologous Gene Information to Prioritize Signals from Human GWAS.PLoS One. 2016 Sep 9;11(9):e0162466. doi: 10.1371/journal.pone.0162466. eCollection 2016. PLoS One. 2016. PMID: 27612175 Free PMC article.
-
Clinical review: Genome-wide association studies of skeletal phenotypes: what we have learned and where we are headed.J Clin Endocrinol Metab. 2012 Oct;97(10):E1958-77. doi: 10.1210/jc.2012-1890. Epub 2012 Sep 10. J Clin Endocrinol Metab. 2012. PMID: 22965941 Free PMC article. Review.
-
Leveraging genome-wide association studies to better understand the etiology of cancers.Cancer Sci. 2025 Feb;116(2):288-296. doi: 10.1111/cas.16402. Epub 2024 Nov 19. Cancer Sci. 2025. PMID: 39561785 Free PMC article. Review.
Cited by
-
ipd: an R package for conducting inference on predicted data.Bioinformatics. 2025 Feb 4;41(2):btaf055. doi: 10.1093/bioinformatics/btaf055. Bioinformatics. 2025. PMID: 39898809 Free PMC article.
-
Can AI reveal the next generation of high-impact bone genomics targets?Bone Rep. 2025 Mar 24;25:101839. doi: 10.1016/j.bonr.2025.101839. eCollection 2025 Jun. Bone Rep. 2025. PMID: 40225702 Free PMC article. Review.
-
Genetic association studies using disease liabilities from deep neural networks.Am J Hum Genet. 2025 Mar 6;112(3):675-692. doi: 10.1016/j.ajhg.2025.01.019. Epub 2025 Feb 21. Am J Hum Genet. 2025. PMID: 39986278 Free PMC article.
-
Bridging Genomic Research Disparities in Osteoporosis GWAS: Insights for Diverse Populations.Curr Osteoporos Rep. 2025 May 24;23(1):24. doi: 10.1007/s11914-025-00917-2. Curr Osteoporos Rep. 2025. PMID: 40411668 Free PMC article. Review.
-
Improving plant breeding through AI-supported data integration.Theor Appl Genet. 2025 Jun 2;138(6):132. doi: 10.1007/s00122-025-04910-2. Theor Appl Genet. 2025. PMID: 40455285 Review.
References
-
- Uffelmann E et al. Genome-wide association studies. Nature Reviews Methods Primers 1, 59 (2021).
-
- Cosentino J et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nature Genetics, 1–9 (2023). - PubMed
Methods-only references
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources