Quality control and quality assurance in genotypic data for genome-wide association studies

Affiliations

PMID: 20718045
PMCID: PMC3061487
DOI: 10.1002/gepi.20516

Quality control and quality assurance in genotypic data for genome-wide association studies

Cathy C Laurie et al. Genet Epidemiol. 2010 Sep.

. 2010 Sep;34(6):591-602.

doi: 10.1002/gepi.20516.

Affiliation

¹ Department of Biostatistics, University of Washington, Seattle, Washington, USA.

PMID: 20718045
PMCID: PMC3061487
DOI: 10.1002/gepi.20516

Abstract

Genome-wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome-wide association studies (GWAS). This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy-Weinberg equilibrium test P-values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis to SNP selection. The methods are illustrated with examples from the "Gene Environment Association Studies" (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of GWAS.

PubMed Disclaimer

Figures

**Figure 1. Gender and sex chromosome anomalies in the Addiction and T2D HPFS projects**
The X and Y probe intensities are calculated for each sample as the mean of the sum of the normalized intensities of the two alleles for each probe on those chromosomes. Probe pair samples sizes are given in the axis labels. In the Addiction project, the standard error of the mean intensity for each sample ranges from 0.002 to 0.004 for the X chromosome and 0.007 to 0.018 for the Y chromosome. In the T2D HPFS study, the standard error of the mean intensity for each sample ranges from 5 to 8 for the X chromosome and from 20 to 98 for the Y chromosome. X heterozygosity is the fraction of heterozygous calls out of all non-missing genotype calls on the X for each sample. Red/blue symbols are for subjects annotated as female/male. Symbols designate the tissue source of DNA samples, where triangle is for whole blood and circle is for lymphoblastic cell lines.

**Figure 2. Allelic imbalance reveals mosaic aneuploidy**
Scans of BAF for two blood samples in the Lung Cancer project indicate X chromosome aneuploidy in one and chromosome 8 aneuploidy in the other. In both cases, the evidence suggests cell populations that are mosaic for normal and aneuploid cells (see text).

**Figure 3. Relatedness inference from IBD estimates for the Lung Cancer project**
Estimates of the IBD coefficients, Z₀ and Z₁, are used to infer relatedness. Each point is for a pair of samples and the diagonal line is Z₀ + Z₁ = 1. The orange bars show the expected values +/− 2 standard deviations (SD) for full sibs (Z₀ = 0.25 +/− 0.08, Z₁ = 0.50 +/− 0.10), half sibs (Z₁ = 0.5 +/− 0.10, Z₀ = 1−Z₁) and first cousins (Z₁ = 0.25 +/− 0.08, Z₀ = 1−Z₁). Parent-offspring pairs are expected to occur at Z₁=1 and duplicates (or identical twins) at Z₀=Z₁=0. Only pairs of samples with kinship coefficient estimates > 1/32 are plotted. (This truncation is responsible for the sharp downturn at the lower right end of the diagonal.)

**Figure 4. Exact HWE test statistic and minor allele frequency**
The data presented are for autosomal SNPs in European-ancestry subjects, either for the Addiction study (Illumina Human1M array) or for the NHS study of the Diabetes project (Affymetrix 6.0 array). The sample sizes are 1365 for Addiction and 1752 for the NHS study. The SNPs tested in the Addiction project (930,358) were filtered by excluding SNPs with a missing call rate greater than 15% (and some other criteria, Table I). The NHS SNP test results shown here (867,003) are filtered by exclude SNPs with a missing call rate greater than 15%. The plot for a completely unfiltered set is very similar. SNPs colored in red are those for which one of the two homozygotes occurs at less than 10% of the expected value, while those in green are those for which heterozygotes occur at less than 10% of the expected value. See Figure S13 for a version of this figure in which the Y- axis is focused on −log₁₀(p-value) from 0 to 10. See Figure S12 for a theoretical explanation of curves highlighted in green and red.

See this image and copyright information in PMC

References

1. Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet. 2006;38(6):659–662. - PubMed
1. Broman KW. Cleaning genotype data. Genet Epidemiol. 1999;17 Suppl 1:S79–S83. - PubMed
1. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. - PubMed
1. Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, Hirschhorn JN, Abecasis G, Altshuler D, Bailey-Wilson JE, et al. Replicating genotype-phenotype associations. Nature. 2007;447(7145):655–660. - PubMed
1. Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE, et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet. 2005;37(11):1243–1246. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- Collaborative Study on the Genetics of Alcoholism
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Quality control and quality assurance in genotypic data for genome-wide association studies

Affiliation

Quality control and quality assurance in genotypic data for genome-wide association studies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases