Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep;5(9):1564-73.
doi: 10.1038/nprot.2010.116. Epub 2010 Aug 26.

Data quality control in genetic case-control association studies

Affiliations

Data quality control in genetic case-control association studies

Carl A Anderson et al. Nat Protoc. 2010 Sep.

Abstract

This protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We also detail other quality-control procedures, including the use of SMARTPCA software for the identification of ancestral outliers. These platforms were selected because they are user-friendly, widely used and computationally efficient. Steps needed to detect and establish a disease association using case-control data are not discussed here. Issues concerning study design and marker selection in case-control studies have been discussed in our earlier protocols. This protocol, which is routinely used in our labs, should take approximately 8 h to complete.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Genotype failure rate vs. heterozygosity across all individuals the study. Shading indicates sample density and dashed lines denote QC thresholds.
Figure 2
Figure 2
Ancestry clustering based on genome-wide association data. HapMap3 reference samples: CEU (red), CHB+JPT (purple) and YRI (green). GWA samples: black crosses. 11 cases and 19 controls with a 2nd principal component score less than 0.072 (grey dashed line) were marked for removal.
Figure 3
Figure 3
Histogram of missing data rate across all individuals passing ‘per-individual’ QC. The dashed vertical line represents the threshold (3%) at which SNPs were removed from further analysis due to an excess failure rate.

Similar articles

Cited by

References

    1. Zondervan KT, Cardon LR. Designing candidate gene and genome-wide case-control association studies. Nat Protoc. 2007;2:2492. - PMC - PubMed
    1. Teo YY, et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics. 2007;23:2741. - PMC - PubMed
    1. The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661. - PMC - PubMed
    1. Clayton DG, et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet. 2005;37:1243. - PubMed
    1. Marchini J, Howie B, Myers SR, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007;39:906. - PubMed

Publication types