Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov;53(5-6):404-415.
doi: 10.1007/s10519-023-10152-z. Epub 2023 Sep 15.

Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics

Affiliations

Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics

Camille M Williams et al. Behav Genet. 2023 Nov.

Abstract

Proprietary genetic datasets are valuable for boosting the statistical power of genome-wide association studies (GWASs), but their use can restrict investigators from publicly sharing the resulting summary statistics. Although researchers can resort to sharing down-sampled versions that exclude restricted data, down-sampling reduces power and might change the genetic etiology of the phenotype being studied. These problems are further complicated when using multivariate GWAS methods, such as genomic structural equation modeling (Genomic SEM), that model genetic correlations across multiple traits. Here, we propose a systematic approach to assess the comparability of GWAS summary statistics that include versus exclude restricted data. Illustrating this approach with a multivariate GWAS of an externalizing factor, we assessed the impact of down-sampling on (1) the strength of the genetic signal in univariate GWASs, (2) the factor loadings and model fit in multivariate Genomic SEM, (3) the strength of the genetic signal at the factor level, (4) insights from gene-property analyses, (5) the pattern of genetic correlations with other traits, and (6) polygenic score analyses in independent samples. For the externalizing GWAS, although down-sampling resulted in a loss of genetic signal and fewer genome-wide significant loci; the factor loadings and model fit, gene-property analyses, genetic correlations, and polygenic score analyses were found robust. Given the importance of data sharing for the advancement of open science, we recommend that investigators who generate and share down-sampled summary statistics report these analyses as accompanying documentation to support other researchers' use of the summary statistics.

Keywords: Data removal; Down-sample; Genome-wide association study; Genomic SEM; Genomics; Leave-one-out; Meta-analysis; Summary statistics.

PubMed Disclaimer

Conflict of interest statement

Camille M. Williams, Holly Poore, Peter T. Tanksley, Hyeokmoon Kweon, Natasia S. Courchesne-Krak, Diego Londono-Correa, Travis T. Mallard, Peter Barr, Philipp D. Koellinger, Irwin D. Waldman, Sandra Sanchez-Roige, K. Paige Harden, Abraham A Palmer, Danielle M. Dick and Richard Karlsson Linnér declare that they have no conflict of interest.

Figures

Fig. 1
Fig. 1
LD Score genetic correlations and heritability estimates for the seven indicator phenotypes of the single-factor models of EXT and EXT-minus-23andMe (see Step 1). The left panel displays the analysis of the original study with 23andMe data, the middle panel displays the down-sampled analysis excluding 23andMe data, and the right panel displays the difference in estimates computed by subtracting the values in the middle panel from those in the left panel. The lower and upper triangles display pairwise genetic correlation (rg) estimates and standard errors, respectively. The diagonals display the observed-scale heritability (h2; see Table 1 for standard errors). These results are also reported in Table S1. ADHD attention-deficit/hyperactivity disorder; ALCP problematic alcohol use; CANN lifetime cannabis use; FSEX age at first sexual intercourse (reverse coded); NSEX number of sexual partners; RISK risk tolerance; SMOK lifetime tobacco initiation
Fig. 2
Fig. 2
Path diagram of a single-factor model with seven indicator phenotypes, of which SMOK and CANN are down-sampled, as estimated with Genomic SEM. These results are also reported in Table S2. Neither the factor loadings nor residual variances were statistically different from the original estimates (a path diagram of the original estimates was therefore omitted). The same figure displaying the results of the original study is available here: https://www.nature.com/articles/s41593-021-00908-3/figures/1. EXT-minus-23andMe genetic externalizing factor; ADHD attention-deficit/hyperactivity disorder; ALCP problematic alcohol use; CANN lifetime cannabis use; FSEX age at first sexual intercourse (reverse coded); NSEX number of sexual partners; RISK risk tolerance; SMOK lifetime tobacco initiation; AIC Akaike Information Criterion; CFI comparative fit index; SRMR standardized root mean square residual
Fig. 3
Fig. 3
Scatterplot of genetic correlations (rg) and marginal density plots between EXT (y-axis) or EXT-minus-23andMe (x-axis) with 77 other phenotypes. Each point corresponds to the genetic correlation coefficient with its 95% confidence intervals (rg±1.96×SE) estimated with bivariate LD Score regression. Table S5 reports the estimates, their standard errors, and confidence intervals. The Spearman rank correlation reported in the figure is rounded from r = 0.9995. No particular shape, such as a normal distribution, is expected for the marginal density because the figure displays an arbitrary selection of traits
Fig. 4
Fig. 4
Comparison of the down-sampled polygenic score (PGS) analyses in Add Health (29 phenotypes) and the Collaborative Study on the Genetics of Alcoholism (COGA; 26 phenotypes). Panel A displays the standardized difference between the coefficient estimates (i.e., a Z-statistic) of the down-sampled PGS for EXT-minus-23andMe versus the PGS for EXT from the original study. Absolute values were evaluated so that a negative standardized difference refers to an attenuation towards zero in the down-sampled analysis. Panel B displays the same measure but as a histogram. Four coefficient estimates were significantly (at the 5% level) attenuated in the down-sampled analysis: lifetime smoking initiation (Add Health and COGA; P = 3.18 × 10–5 and 4.17 × 10–5, respectively), the phenotypic externalizing factor (Add Health; P = 0.046), and lifetime cannabis use (Add Health, P = 0.03). None of the coefficients were significantly larger in the down-sampled analysis. Panel C displays a scatter plot of the absolute value of the coefficient estimates divided by their respective standard errors (i.e., a Z-statistic). These results are also reported in Table S6

Update of

Similar articles

Cited by

References

    1. Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023 doi: 10.1016/j.ajhg.2022.12.011. - DOI - PMC - PubMed
    1. Allen Institute for Brain Science. (2022). BrainSpan atlas of the developing human brain. http://www.brainspan.org/. Accessed 22 Dec 2022
    1. Becker J, Burik CAP, Goldman G, Wang N, Jayashankar H, Bennett M, Belsky DW, Karlsson Linnér R, Ahlskog R, Kleinman A, Hinds DA, Caspi A, Corcoran DL, Moffitt TE, Poulton R, Sugden K, Williams BS, Harris KM, Steptoe A, et al. Resource profile and user guide of the polygenic index repository. Nat Hum Behaviour. 2021;5(12):12. doi: 10.1038/s41562-021-01119-3. - DOI - PMC - PubMed
    1. Begleiter H. The collaborative study on the genetics of alcoholism. Alcohol Health Res World. 1995;19(3):228–236. - PMC - PubMed
    1. Bucholz KK, McCutcheon VV, Agrawal A, Dick DM, Hesselbrock VM, Kramer JR, Kuperman S, Nurnberger JI, Salvatore JE, Schuckit MA, Bierut LJ, Foroud TM, Chan G, Hesselbrock M, Meyers JL, Edenberg HJ, Porjesz B. Comparison of parent, peer, psychiatric, and cannabis use influences across stages of offspring alcohol involvement: evidence from the COGA prospective study. Alcohol Clin Exp Res. 2017;41(2):359–368. doi: 10.1111/acer.13293. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources