Characterizing substructure via mixture modeling in large-scale genetic summary statistics
- PMID: 39824191
- PMCID: PMC11866976
- DOI: 10.1016/j.ajhg.2024.12.007
Characterizing substructure via mixture modeling in large-scale genetic summary statistics
Abstract
Genetic summary data are broadly accessible and highly useful, including for risk prediction, causal inference, fine mapping, and incorporation of external controls. However, collapsing individual-level data into summary data, such as allele frequencies, masks intra- and inter-sample heterogeneity, leading to confounding, reduced power, and bias. Ultimately, unaccounted-for substructure limits summary data usability, especially for understudied or admixed populations. There is a need for methods to enable the harmonization of summary data where the underlying substructure is matched between datasets. Here, we present Summix2, a comprehensive set of methods and software based on a computationally efficient mixture model to enable the harmonization of genetic summary data by estimating and adjusting for substructure. In extensive simulations and application to public data, we show that Summix2 characterizes finer-scale population structure, identifies ascertainment bias, and scans for potential regions of selection due to local substructure deviation. Summix2 increases the robust use of diverse, publicly available summary data, resulting in improved and more equitable research.
Keywords: admixed; confounding; equitable research; federated learning; genetic similarity; genetic summary data; harmonization; local ancestry; population stratification; selection; substructure; summary data.
Copyright © 2024 The Authors. Published by Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of interests C.R.G. owns stock in 23andMe, Inc.
Figures
Update of
-
Characterizing substructure via mixture modeling in large-scale genetic summary statistics.bioRxiv [Preprint]. 2024 May 13:2024.01.29.577805. doi: 10.1101/2024.01.29.577805. bioRxiv. 2024. Update in: Am J Hum Genet. 2025 Feb 06;112(2):235-253. doi: 10.1016/j.ajhg.2024.12.007. PMID: 38766180 Free PMC article. Updated. Preprint.
References
-
- Hendricks A.E., Billups S.C., Pike H.N.C., Farooqi I.S., Zeggini E., Santorico S.A., Barroso I., Dupuis J. ProxECAT: Proxy External Controls Association Test. A new case-control gene region association test using allele frequencies from public controls. PLoS Genet. 2018;14 doi: 10.1371/journal.pgen.1007591. - DOI - PMC - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
