Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(12):e51013.
doi: 10.1371/journal.pone.0051013. Epub 2012 Dec 12.

A measure of the signal-to-noise ratio of microarray samples and studies using gene correlations

Affiliations

A measure of the signal-to-noise ratio of microarray samples and studies using gene correlations

David Venet et al. PLoS One. 2012.

Abstract

Background: The quality of gene expression data can vary dramatically from platform to platform, study to study, and sample to sample. As reliable statistical analysis rests on reliable data, determining such quality is of the utmost importance. Quality measures to spot problematic samples exist, but they are platform-specific, and cannot be used to compare studies.

Results: As a proxy for quality, we propose a signal-to-noise ratio for microarray data, the "Signal-to-Noise Applied to Gene Expression Experiments", or SNAGEE. SNAGEE is based on the consistency of gene-gene correlations. We applied SNAGEE to a compendium of 80 large datasets on 37 platforms, for a total of 24,380 samples, and assessed the signal-to-noise ratio of studies and samples. This allowed us to discover serious issues with three studies. We show that signal-to-noise ratios of both studies and samples are linked to the statistical significance of the biological results.

Conclusions: We showed that SNAGEE is an effective way to measure data quality for most types of gene expression studies, and that it often outperforms existing techniques. Furthermore, SNAGEE is platform-independent and does not require raw data files. The SNAGEE R package is available in BioConductor.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Link between statistical significance and study SNR in NCI60 datasets.
Statistical significance is estimated as the fraction of differentially expressed genes (%DEG) between a cancer type and the other cancer types, each panel representing a different cancer type. Higher SNR leads to higher %DEG, as expected. formula image are on Affymetrix platforms (GSE5720 and GSE5949), formula image is on NCI dual channel (GSE2003) and formula image is on Stanford dual channel (GSE7947). c.c. are correlation coefficients.
Figure 2
Figure 2. Effect of increasing levels of noise on study SNR and statistical significance.
Statistical significance was estimated as the fraction of differentially expressed genes (%DEG). Noise was added to a study of the NCI60 on the U133A platform (GSE5720). SNR and %DEG were calculated on the modified study. Increasing noise lead to lower SNR and lower %DEG. c.c. are correlation coefficients, which underestimate the strength of the relations as those are not linear.
Figure 3
Figure 3. Effect of biological outliers on study SNR.
The study SNRs were calculated in function of the number of biological outliers added to a study consiting of homogeneous samples. (A) Outliers are cell lines, original study consists of tumor and normal tissues. (B) Outliers are ESR− breast cancers, original study consists of ESR+ breast cancers.
Figure 4
Figure 4. SNR in function of the number of samples: effect of disattenuation.
Subsets of a very large dataset (expO) were created by randomly selecting a number of samples. The study SNRs of those subsets are shown in function of their size, with disattenuation (formula image) or without (formula image). The error bars are the 95% confidence intervals determined by resampling. The horizontal line is the SNR of the whole data set.
Figure 5
Figure 5. Comparison of quality metrics with Bayes classifier posterior probabilities.
Samples were classified by a naive Bayes classifier. Bad quality samples should not fit the classification well, and have lower probabilities. The average of the ranks of the posterior probabilities of the formula image lowest quality samples is shown. Quality of the samples was determined with SNAGEE (solid line), MDQC (dashed line), GNUSE (dotted line), NUSE (grey dotted line) and RLE (dash-dot line). Classification was done on breast cancers according to ER status (A:GSE20194 and B:GSE4922); on muscle disease samples according to gender (C) or to the muscle disease type (D), one class vs. the rest. The solid grey line uses the samples assigned the highest SNR (and so likely to be biological outliers) by SNAGEE.
Figure 6
Figure 6. SNR of samples as biological outliers.
The SNR of the samples within the complete study (x-axis) are compared with their SNR when removing all similar samples (y-axis). A–M. Muscle diseases, N. Cell lines vs. normal and diseased tissues.

References

    1. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, et al. (2006) The microarray quality control (maqc) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24: 1151–1161. - PMC - PubMed
    1. Shi L, Campbell G, Jones WD, Campagne F, Wen Z, et al. (2010) The microarray quality control (maqc)-ii study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 28: 827–38. - PMC - PubMed
    1. Wilson CL, Miller CJ (2005) Simpleaffy: a bioconductor package for affymetrix quality control and data analysis. Bioinformatics 21: 3683–3685. - PubMed
    1. Dunning MJ, Smith ML, Ritchie ME, Tavare S (2007) beadarray: R classes and methods for illumina bead-based data. Bioinformatics 23: 2183–2184. - PubMed
    1. Cohen Freue GV, Hollander Z, Shen E, Zamar RH, Balshaw R, et al. (2007) Mdqc: a new quality assessment method for microarrays based on quality control reports. Bioinformatics 23: 3162–3169. - PubMed

Publication types

Associated data