Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 15;26(2):242-9.
doi: 10.1093/bioinformatics/btp624. Epub 2009 Nov 11.

Quantifying uncertainty in genotype calls

Affiliations

Quantifying uncertainty in genotype calls

Benilton S Carvalho et al. Bioinformatics. .

Abstract

Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.

Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.

Availability: Software implementing of the method described in this article is available as free and open source code in the crlmm R/BioConductor package.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The intensity of both the alleles is plotted against each other, i.e. IA versus IB, for four randomly selected SNPs. The three circles illustrate the distribution of the data for each genotype (AA: green; AB: orange; BB: violet) for the first SNP. Note that these regions are incompatible with the data for the three other SNPs. This figure illustrates that the SNP to SNP variability is much larger than the within SNP variability and that naive genotyping algorithms that define global thresholds are not appropriate.
Fig. 2.
Fig. 2.
The advantage of modeling M instead of (IA, IB): here, we plot M versus S for the same data as shown in Figure 1. The across SNP variability is smaller for M than for S. However, the probe effect is not completely removed as seen in the SNP in the bottom right panel. Note that for this SNP the cluster centers are substantially shifted.
Fig. 3.
Fig. 3.
An example of an SNP with three clear clusters: the calls derived from the algorithm are represented by colors (AA: green; AB: orange and BB: violet). The observation with the red circle around it was incorrectly called BB and, under the normal assumption for the residuals, the posterior was 0.999. With the assumption that the residuals follow a t-distribution, the posterior was penalized and reduced to 0.500.
Fig. 4.
Fig. 4.
Plots of formula image for a given batch. Note that they are correlated. We take advantage of this correlation to predict or improve precision of shifts when not enough training data are available. The ellipses delimit the 95% confidence regions of the estimated distribution. SNPs with points outside these regions are associated with abnormal movements and are flagged as possible outliers. (A) formula image versus formula image. (B) formula image versus formula image. The plot for formula image versus formula image is similar to that shown in (A).
Fig. 5.
Fig. 5.
ADR plots for Datasets Aand B. SNPs were stratified by their quality scores and ADR curves were produced for each stratum. The scores are shown to successfully identify SNPs with lower accuracies. The removal of such SNPs significantly increases the method's accuracy.
Fig. 6.
Fig. 6.
Batch quality plots. (A) The concordance with a 5% drop rate is plotted against the percentage of sample flagged by the SNR score. (B) The concordance with a 5% drop rate is plotted against our batch quality score.
Fig. 7.
Fig. 7.
ADR plots for Datasets A and B. For the first set, CRLMM version 2 outperforms both Birdseed and CRLMM version 1. For the second set, it outperforms the other two methods roughly at a drop rate of 6%. Also note that the accuracy on the second dataset is lower when compared with the first one, indicating significant variation on the quality of the two sets.
Fig. 8.
Fig. 8.
For Dataset A, calls were stratified by their associated posterior. For each strata the observed accuracy was computed by comparing to HapMap gold standard calls. CRLMM version 2 is compared with CRLMM version 1, which is clearly too optimistic. The dashed lines represent homozygotes, dotted lines the heterozygotes and the solid lines the overall accuracies.

References

    1. Affymetrix. Technical report. Affymetrix; 2006. BRLMM: an improved genotype calling method for the genechip human mapping 500k array set.
    1. Affymetrix. Technical report. Affymetrix; 2007. BRLMM-P: a genotype calling method for the SNP 5.0 array.
    1. Bash L, et al. Inflammation, hemostasis, and the risk of kidney function decline in the atherosclerosis risk in communities (aric) study. Am. J. Kidney Dis. 2009;53:572–575. - PMC - PubMed
    1. Carvalho B, et al. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007;8:485–499. - PubMed
    1. Di X, et al. Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays. Bioinformatics. 2005;21:1958–1963. - PubMed

Publication types

MeSH terms