Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep;34(6):591-602.
doi: 10.1002/gepi.20516.

Quality control and quality assurance in genotypic data for genome-wide association studies

Affiliations

Quality control and quality assurance in genotypic data for genome-wide association studies

Cathy C Laurie et al. Genet Epidemiol. 2010 Sep.

Abstract

Genome-wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome-wide association studies (GWAS). This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy-Weinberg equilibrium test P-values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis to SNP selection. The methods are illustrated with examples from the "Gene Environment Association Studies" (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of GWAS.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Gender and sex chromosome anomalies in the Addiction and T2D HPFS projects
The X and Y probe intensities are calculated for each sample as the mean of the sum of the normalized intensities of the two alleles for each probe on those chromosomes. Probe pair samples sizes are given in the axis labels. In the Addiction project, the standard error of the mean intensity for each sample ranges from 0.002 to 0.004 for the X chromosome and 0.007 to 0.018 for the Y chromosome. In the T2D HPFS study, the standard error of the mean intensity for each sample ranges from 5 to 8 for the X chromosome and from 20 to 98 for the Y chromosome. X heterozygosity is the fraction of heterozygous calls out of all non-missing genotype calls on the X for each sample. Red/blue symbols are for subjects annotated as female/male. Symbols designate the tissue source of DNA samples, where triangle is for whole blood and circle is for lymphoblastic cell lines.
Figure 2
Figure 2. Allelic imbalance reveals mosaic aneuploidy
Scans of BAF for two blood samples in the Lung Cancer project indicate X chromosome aneuploidy in one and chromosome 8 aneuploidy in the other. In both cases, the evidence suggests cell populations that are mosaic for normal and aneuploid cells (see text).
Figure 3
Figure 3. Relatedness inference from IBD estimates for the Lung Cancer project
Estimates of the IBD coefficients, Z0 and Z1, are used to infer relatedness. Each point is for a pair of samples and the diagonal line is Z0 + Z1 = 1. The orange bars show the expected values +/− 2 standard deviations (SD) for full sibs (Z0 = 0.25 +/− 0.08, Z1 = 0.50 +/− 0.10), half sibs (Z1 = 0.5 +/− 0.10, Z0 = 1−Z1) and first cousins (Z1 = 0.25 +/− 0.08, Z0 = 1−Z1). Parent-offspring pairs are expected to occur at Z1=1 and duplicates (or identical twins) at Z0=Z1=0. Only pairs of samples with kinship coefficient estimates > 1/32 are plotted. (This truncation is responsible for the sharp downturn at the lower right end of the diagonal.)
Figure 4
Figure 4. Exact HWE test statistic and minor allele frequency
The data presented are for autosomal SNPs in European-ancestry subjects, either for the Addiction study (Illumina Human1M array) or for the NHS study of the Diabetes project (Affymetrix 6.0 array). The sample sizes are 1365 for Addiction and 1752 for the NHS study. The SNPs tested in the Addiction project (930,358) were filtered by excluding SNPs with a missing call rate greater than 15% (and some other criteria, Table I). The NHS SNP test results shown here (867,003) are filtered by exclude SNPs with a missing call rate greater than 15%. The plot for a completely unfiltered set is very similar. SNPs colored in red are those for which one of the two homozygotes occurs at less than 10% of the expected value, while those in green are those for which heterozygotes occur at less than 10% of the expected value. See Figure S13 for a version of this figure in which the Y- axis is focused on −log10(p-value) from 0 to 10. See Figure S12 for a theoretical explanation of curves highlighted in green and red.

References

    1. Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet. 2006;38(6):659–662. - PubMed
    1. Broman KW. Cleaning genotype data. Genet Epidemiol. 1999;17 Suppl 1:S79–S83. - PubMed
    1. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. - PubMed
    1. Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, Hirschhorn JN, Abecasis G, Altshuler D, Bailey-Wilson JE, et al. Replicating genotype-phenotype associations. Nature. 2007;447(7145):655–660. - PubMed
    1. Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE, et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet. 2005;37(11):1243–1246. - PubMed

Publication types