Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul;13(3):539-52.
doi: 10.1093/biostatistics/kxr034. Epub 2011 Nov 17.

Using control genes to correct for unwanted variation in microarray data

Affiliations

Using control genes to correct for unwanted variation in microarray data

Johann A Gagnon-Bartsch et al. Biostatistics. 2012 Jul.

Abstract

Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.

PubMed Disclaimer

Figures

Fig 1.
Fig 1.
Gender study RLE plots at different stages of preprocessing. From left to right: No preprocessing; BG/QN done separately for each platform type; BG/QN followed by a final LS across all chips. Coloring: red—site A, HG-U95A; yellow—site A, HG-U95Av2; black—site B, HG-U95A; gray—site B, HG-U95Av2; and cyan—site C, HG-U95Av2. Note: The scale on the y-axis is different for these RLE plots than for all other RLE plots in this paper.
Fig 2.
Fig 2.
Gender study p-value histograms and RLE plots before and after adjustment. Histogram breakpoints are at 0.001, 0.01, 0.05 and 0.1, 0.2, 0.3, etc. Data were fully preprocessed (BG + QN + LS). The factors were computed by SVD on the HK genes. P-values were computed using Limma.
Fig 3.
Fig 3.
Comparison of performance of different factor analysis methods in the gender study. The number of X/Y genes discovered is plotted as a function of k. Genes were ranked by p-value; results are shown for the number of X/Y genes ranked in the top 20 (plus), top 40 (triangle), and top 60 (circle). k=0 corresponds to no adjustment. Data were preprocessed (BG + NM + LS). Principal components were computed using the HK genes. P-values were computed using Limma.
Figure 4
Figure 4
Comparison of results for HK genes and Affymentrix spike-in controls in the gender study. For the RLE plots and p-value histograms, k=10. Factors were computed by SVD. P-values were computed using Limma. Note that there are only 33 spike-in controls, so adjustments with k>33 are undefined for the spike-in case. We truncate

Similar articles

Cited by

References

    1. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America. 2000;97:10101–10106. - PMC - PubMed
    1. Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006.
    1. Bolstad B, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry R, Speed TP. Quality assessment of Affymetrix GeneChip data. In: Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S, editors. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005. pp. 33–47.
    1. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. - PubMed
    1. Brettschneider J, Collin F, Bolstad BM, Speed TP. Quality assessment for short oligonucleotide microarray data. Technometrics. 2008;50:241–264.

Publication types