Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Oct 29:2024.10.24.619530.
doi: 10.1101/2024.10.24.619530.

CCAFE: Estimating Case and Control Allele Frequencies from GWAS Summary Statistics

Affiliations

CCAFE: Estimating Case and Control Allele Frequencies from GWAS Summary Statistics

Hayley R Stoneman et al. bioRxiv. .

Abstract

Methods involving summary statistics in genetics can be quite powerful but can be limited in utility. For instance, many post-hoc analyses of disease studies require case and control allele frequencies (AFs), which are not always published. We present two frameworks to derive case and control AFs from GWAS summary statistics using the odds ratio, case and control sample sizes, and either the total (case and control aggregated) AF or standard error (SE). In simulations and real data, derivations of case and controls AFs using total AF is highly accurate across all settings (e.g., minor AF, condition prevalence). Conversely, derivations using SE underestimate common variant AFs (e.g. minor allele frequency >0.3) in the presence of covariates. We develop an adjustment using gnomAD AFs as a proxy for true AFs, which reduces the bias when using SE. While estimating case and control AFs using the total AF is preferred due to its high accuracy, estimating from the SE can be used more broadly since SE can be derived from p-values and beta estimates, which are commonly provided. The methods provided here expand the utility of publicly available genetic summary statistics and promote the reusability of genomic data. The R package CCAFE, with implementations of both methods, is freely available on Bioconductor and GitHub.

Keywords: GWAS; R package; allele frequencies; statistical genetics; summary statistics.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Estimated case and control AFs from summary statistics.
Simulated genotypes and phenotypes were generated using the PhenotypeSimulator R package. Genotypes for 10,000 variants, of which 100 were causal (shown in blue), were generated for 5,000 cases and 5,000 controls. Logistic regression was used along with 0 (A) or 3 (B) covariates to generate per variant summary statistics. CaseControl_AF and CaseControl_SE methods were used to estimate the case and control AFs. Using CaseControl_SE, bias was observed at higher MAFs when covariates were included with a systematic underestimation of MAFs (B). CaseControl_AF was accurate across the MAF spectrum, regardless of whether covariates were included. Lin’s CCC is shown between the true simulated MAF and the estimated MAF.
Figure 2.
Figure 2.. Comparison of case and control AF estimation in multiple real datasets.
Results of estimating the case MAFs for six datasets with various sample sizes using (A) CaseControl_SE, the method proposed in the ReACt software using SE, and (B) CaseControl_AF, the framework developed here using total AF. The Prostate Cancer dataset (Ncase=79148; Ncontrol=61106) has 148 variants from a 2018 PRS, and the true case and control AFs were published as part of the discovery GWAS. Diabetes EUR (Ncase=16550; Ncontrol=403923) and Diabetes AFR (Ncase=668; Ncontrol=5956) contain >9 million variants from the PanUKBB GWAS. CaseControl_SE underestimates the true MAF, with bias increasing and precision (width of the boxplot) decreasing as the true MAF increases. Conversely, we see highly accurate estimation of known AFs using CaseControl_AF, with some variability in datasets with small sample sizes.
Figure 3.
Figure 3.. Correction mitigates bias in CaseControl_SE MAF estimates.
We use our bias correction to adjust the case and control MAF estimates from CaseControl_SE for >9M genome-wide variants from the African and European Pan-UKB Diabetes datasets. To estimate the bias correction, we used >1.2M variants on chromosome 1 that were harmonized between Pan-UKBB and gnomAD v3.1.2. GnomAD non-Finnish European (NFE) were used as the proxy for true MAFs for the EUR sample (Right), and gnomAD African/African American (AFR/AFRAM) were used as the proxy for true MAFs for the AFR sample (Left). We see an improvement (i.e., less bias and greater Lin’s CCC) when using the bias correction framework (gray; AFR CCC = 0.9877, EUR CCC = 0.9943), compared to the uncorrected CaseControl_SE MAF estimates (black; AFR CCC = 0.9369, EUR CCC = 0.9382)

References

    1. Buniello A., MacArthur J. A. L., Cerezo M., Harris L. W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E., Suveges D., Vrousgou O., Whetzel P. L., Amode R., Guillen J. A., Riat H. S., Trevanion S. J., Hall P., Junkins H., Flicek P., Burdett T., Hindorff L. A., Cunningham F., & Parkinson H. (2019). The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res, 47(D1), D1005–D1012. 10.1093/nar/gky1120 - DOI - PMC - PubMed
    1. Hayhurst J., Buniello A., Harris L., Mosaku A., Chang C., Gignoux C. R., Hatzikotoulas K., Karim M. A., Lambert S. A., Lyon M., McMahon A., Okada Y., Pirastu N., Rayner N. W., Schwartzentruber J., Vaughan R., Verma S., Wilder S. P., Cunningham F., Hindorff L., Wiley K., Parkinson H., & Barroso I. (2023). A community driven GWAS summary statistics standard. bioRxiv.
    1. Hendricks A. E., Billups S. C., Pike H. N. C., Farooqi I. S., Zeggini E., Santorico S. A., Barroso I., & Dupuis J. (2018). ProxECAT: Proxy External Controls Association Test. A new case-control gene region association test using allele frequencies from public controls. PLoS Genet, 14(10), e1007591. 10.1371/journal.pgen.1007591 - DOI - PMC - PubMed
    1. Hinrichs A. S., Karolchik D., Baertsch R., Barber G. P., Bejerano G., Clawson H., Diekhans M., Furey T. S., Harte R. A., Hsu F., Hillman-Jackson J., Kuhn R. M., Pedersen J. S., Pohl A., Raney B. J., Rosenbloom K. R., Siepel A., Smith K. E., Sugnet C. W., Sultan-Qurraie A., Thomas D. J., Trumbower H., Weber R. J., Weirauch M., Zweig A. S., Haussler D., & Kent W. J. (2006). The UCSC Genome Browser Database: update 2006. Nucleic Acids Res, 34(Database issue), D590–598. 10.1093/nar/gkj144 - DOI - PMC - PubMed
    1. Huber W., Carey V. J., Gentleman R., Anders S., Carlson M., Carvalho B. S., Bravo H. C., Davis S., Gatto L., Girke T., Gottardo R., Hahne F., Hansen K. D., Irizarry R. A., Lawrence M., Love M. I., MacDonald J., Obenchain V., Oles A. K., Pages H., Reyes A., Shannon P., Smyth G. K., Tenenbaum D., Waldron L., & Morgan M. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods, 12(2), 115–121. 10.1038/nmeth.3252 - DOI - PMC - PubMed

Publication types

LinkOut - more resources