Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec;35(8):887-98.
doi: 10.1002/gepi.20639.

Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality

Affiliations

Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality

Rebecca L Zuvich et al. Genet Epidemiol. 2011 Dec.

Abstract

Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Principal components analysis (comparing eigenvector 1 vs. eigenvector 2 using Eigenstrat) color-coded by eMERGE site (prior to strand fix and prior to merging the Northwestern “clean” dataset). The tail observed was originally thought to be the non-European American samples. The green “x” and red “+” correspond to Marshfield and Group Health samples, respectively, which were genotyped at CIDR; the blue “+” and black “+” correspond to Mayo and Vanderbilt samples, respectively, which were genotyped at the Broad.
Figure 2
Figure 2
Manhattan plot showing test of association results for a merged dataset (prior to strand fix). The x-axis corresponds to the each genotyped SNP along the genome; the y-axis corresponds to the -log (p-value). The red line indicated genome-wide significance. The q-q plot (top) illustrates that there are many more significant results (black line) than it expected by chance (red dotted line).
Figure 3
Figure 3
Flowchart illustrating concordance checks for HapMap and duplicate samples within each dataset (i.e. Dataset #1 (blue) and #2 (red)) and complications that can arise in a merged dataset (orange) if there are strand issues.
Figure 4
Figure 4
Manhattan plot showing test of association results for a merged dataset (after the strand fix). The x-axis corresponds to the each genotyped SNP along the genome; the y-axis corresponds to the -log (p-value). The red line indicated genome-wide significance. The q-q plot (top) illustrates that there are a few more significant results (black line) than it expected by chance (red dotted line).
Figure 5
Figure 5
Identical by descent (IBD) plot from merged dataset illustrating the proportion of SNPs that are shared between each pair-wise group of samples (represented by a dot on the plot). Z0 corresponds to the sharing of 0 alleles between each pair. Z1 corresponds to the sharing of 1 allele between each pair. The pairs at (0,0) correspond to duplicate pairs who share 2 alleles in common.
Figure 6
Figure 6
Principal components analysis (comparing eigenvector 1 vs. eigenvector 2 using Eigenstrat) color-coded by eMERGE site (after the strand fix). The non-European American samples were removed in the analysis, thus, no tail is observed as in Figure 1.
Figure 7
Figure 7
Sample and marker call rates for the raw merged dataset (before filtering SNPs from each dataset below 95% genotyping efficiency and then re-merging the datasets together).
Figure 8
Figure 8
Sample and marker call rates for one of the individual datasets.
Figure 9
Figure 9
Flowchart illustrating additional QC steps when merging several datasets

References

    1. Manolio TA. Genomewide association studies and assessment of the risk of disease. N. Engl. J. Med. 2010;363:166–176. - PubMed
    1. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. - PMC - PubMed
    1. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 2001;69:124–137. - PMC - PubMed
    1. Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant…or not? Hum. Mol. Genet. 2002;11:2417–2423. - PubMed
    1. Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. 2001;17:502–510. - PubMed

Publication types

LinkOut - more resources