Statistical Learning Methods Applicable to Genome-Wide Association Studies on Unbalanced Case-Control Disease Data
- PMID: 34068248
- PMCID: PMC8153154
- DOI: 10.3390/genes12050736
Statistical Learning Methods Applicable to Genome-Wide Association Studies on Unbalanced Case-Control Disease Data
Abstract
Despite the fact that imbalance between case and control groups is prevalent in genome-wide association studies (GWAS), it is often overlooked. This imbalance is getting more significant and urgent as the rapid growth of biobanks and electronic health records have enabled the collection of thousands of phenotypes from large cohorts, in particular for diseases with low prevalence. The unbalanced binary traits pose serious challenges to traditional statistical methods in terms of both genomic selection and disease prediction. For example, the well-established linear mixed models (LMM) yield inflated type I error rates in the presence of unbalanced case-control ratios. In this article, we review multiple statistical approaches that have been developed to overcome the inaccuracy caused by the unbalanced case-control ratio, with the advantages and limitations of each approach commented. In addition, we also explore the potential for applying several powerful and popular state-of-the-art machine-learning approaches, which have not been applied to the GWAS field yet. This review paves the way for better analysis and understanding of the unbalanced case-control disease data in GWAS.
Keywords: GWAS; disease; genomic prediction; genomic selection; unbalanced case-control.
Conflict of interest statement
The authors declare no conflict of interest.
Similar articles
-
Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.Nat Genet. 2018 Sep;50(9):1335-1341. doi: 10.1038/s41588-018-0184-y. Epub 2018 Aug 13. Nat Genet. 2018. PMID: 30104761 Free PMC article.
-
Joint analysis of multiple phenotypes for extremely unbalanced case-control association studies.Genet Epidemiol. 2023 Mar;47(2):185-197. doi: 10.1002/gepi.22513. Epub 2023 Jan 24. Genet Epidemiol. 2023. PMID: 36691904
-
GAPIT Version 2: An Enhanced Integrated Tool for Genomic Association and Prediction.Plant Genome. 2016 Jul;9(2). doi: 10.3835/plantgenome2015.11.0120. Plant Genome. 2016. PMID: 27898829
-
Association mapping in plants in the post-GWAS genomics era.Adv Genet. 2019;104:75-154. doi: 10.1016/bs.adgen.2018.12.001. Epub 2019 Jan 22. Adv Genet. 2019. PMID: 31200809 Review.
-
Status and prospects of genome-wide association studies in plants.Plant Genome. 2021 Mar;14(1):e20077. doi: 10.1002/tpg2.20077. Epub 2021 Jan 13. Plant Genome. 2021. PMID: 33442955 Review.
Cited by
-
Genome-Wide Association Study of Growth and Sex Traits Provides Insight into Heritable Mechanisms Underlying Growth Development of Macrobrachium nipponense (Oriental River Prawn).Biology (Basel). 2023 Mar 10;12(3):429. doi: 10.3390/biology12030429. Biology (Basel). 2023. PMID: 36979121 Free PMC article.
-
Mathematical bounds on r2 and the effect size in case-control genome-wide association studies.Theor Popul Biol. 2025 Aug;164:1-11. doi: 10.1016/j.tpb.2025.04.003. Epub 2025 May 15. Theor Popul Biol. 2025. PMID: 40381956
-
Mathematical bounds on and the effect size in case-control genome-wide association studies.bioRxiv [Preprint]. 2024 Dec 17:2024.12.17.628943. doi: 10.1101/2024.12.17.628943. bioRxiv. 2024. Update in: Theor Popul Biol. 2025 Aug;164:1-11. doi: 10.1016/j.tpb.2025.04.003. PMID: 39764044 Free PMC article. Updated. Preprint.
-
Confirmation of HLA-II associations with TB susceptibility in admixed African samples.Elife. 2025 Jun 3;13:RP99200. doi: 10.7554/eLife.99200. Elife. 2025. PMID: 40458991 Free PMC article.
-
A review of model evaluation metrics for machine learning in genetics and genomics.Front Bioinform. 2024 Sep 10;4:1457619. doi: 10.3389/fbinf.2024.1457619. eCollection 2024. Front Bioinform. 2024. PMID: 39318760 Free PMC article. Review.
References
-
- Sudlow C., Gallacher J., Allen N., Beral V., Burton P., Danesh J., Downey P., Elliott P., Green J., Landray M., et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. - DOI - PMC - PubMed
-
- Chen H., Wang C., Conomos M.P., Stilp A.M., Li Z., Sofer T., Szpiro A.A., Chen W., Brehm J.M., Celedón J.C., et al. Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. Am. J. Hum. Genet. 2016;98:653–666. doi: 10.1016/j.ajhg.2016.02.012. - DOI - PMC - PubMed
-
- Fritsche L.G., Gruber S.B., Wu Z., Schmidt E.M., Zawistowski M., Moser S.E., Blanc V.M., Brummett C.M., Kheterpal S., Abecasis G.R., et al. Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. Am. J. Hum. Genet. 2018;102:1048–1061. doi: 10.1016/j.ajhg.2018.04.001. - DOI - PMC - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources