Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Apr 27;32(8):2323-35.
doi: 10.1093/nar/gkh544. Print 2004.

Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling

Affiliations

Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling

Musa H Asyali et al. Nucleic Acids Res. .

Abstract

DNA microarray is an important tool for the study of gene activities but the resultant data consisting of thousands of points are error-prone. A serious limitation in microarray analysis is the unreliability of the data generated from low signal intensities. Such data may produce erroneous gene expression ratios and cause unnecessary validation or post-analysis follow-up tasks. In this study, we describe an approach based on normal mixture modeling for determining optimal signal intensity thresholds to identify reliable measurements of the microarray elements and subsequently eliminate false expression ratios. We used univariate and bivariate mixture modeling to segregate the microarray data into two classes, low signal intensity and reliable signal intensity populations, and applied Bayesian decision theory to find the optimal signal thresholds. The bivariate analysis approach was found to be more accurate than the univariate approach; both approaches were superior to a conventional method when validated against a reference set of biological data that consisted of true and false gene expression data. Elimination of unreliable signal intensities in microarray data should contribute to the quality of microarray data including reproducibility and reliability of gene expression ratios.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Univariate normal mixture histograms and threshold plots. Histograms in 60 bins from three different data sets (AC) represent estimated normal density components for Cy3 green channel (top) and Cy5 red channel (bottom) channels. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or πN(x, µ1, σ1) and (1 – π)N(x, µ2, σ2) are shown by dashed lines and dotted lines, respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities, fg(x) and fr(x) (solid lines in top and bottom panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained in the text. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.
Figure 1
Figure 1
Univariate normal mixture histograms and threshold plots. Histograms in 60 bins from three different data sets (AC) represent estimated normal density components for Cy3 green channel (top) and Cy5 red channel (bottom) channels. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or πN(x, µ1, σ1) and (1 – π)N(x, µ2, σ2) are shown by dashed lines and dotted lines, respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities, fg(x) and fr(x) (solid lines in top and bottom panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained in the text. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.
Figure 1
Figure 1
Univariate normal mixture histograms and threshold plots. Histograms in 60 bins from three different data sets (AC) represent estimated normal density components for Cy3 green channel (top) and Cy5 red channel (bottom) channels. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or πN(x, µ1, σ1) and (1 – π)N(x, µ2, σ2) are shown by dashed lines and dotted lines, respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities, fg(x) and fr(x) (solid lines in top and bottom panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained in the text. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.
Figure 2
Figure 2
Analysis of the goodness of the fit between the Cy3 and Cy5 data channels and their two-component normal mixture model, using qq plots. The qq plots from three different datasets (AC) show quantiles of the actual Cy3 green channel (left) or Cy5 red channel (right) data against the mixture model of each. The dashed line in each plot corresponds to the case of perfect agreement between the compared samples. If the ‘+’ marks, indicating the location of the matching quantiles, roughly lie on the straight line, then the distributions of the samples have the same shape except for a possible shift and rescaling.
Figure 2
Figure 2
Analysis of the goodness of the fit between the Cy3 and Cy5 data channels and their two-component normal mixture model, using qq plots. The qq plots from three different datasets (AC) show quantiles of the actual Cy3 green channel (left) or Cy5 red channel (right) data against the mixture model of each. The dashed line in each plot corresponds to the case of perfect agreement between the compared samples. If the ‘+’ marks, indicating the location of the matching quantiles, roughly lie on the straight line, then the distributions of the samples have the same shape except for a possible shift and rescaling.
Figure 2
Figure 2
Analysis of the goodness of the fit between the Cy3 and Cy5 data channels and their two-component normal mixture model, using qq plots. The qq plots from three different datasets (AC) show quantiles of the actual Cy3 green channel (left) or Cy5 red channel (right) data against the mixture model of each. The dashed line in each plot corresponds to the case of perfect agreement between the compared samples. If the ‘+’ marks, indicating the location of the matching quantiles, roughly lie on the straight line, then the distributions of the samples have the same shape except for a possible shift and rescaling.
Figure 3
Figure 3
Bivariate normal mixture scatter plots, classification plots and validation plots. Cy3 (green channel) and Cy5 (red channel) scatter plots for the three data sets (AC) are shown. The gray ‘+’ marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden’s method for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using the bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.
Figure 3
Figure 3
Bivariate normal mixture scatter plots, classification plots and validation plots. Cy3 (green channel) and Cy5 (red channel) scatter plots for the three data sets (AC) are shown. The gray ‘+’ marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden’s method for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using the bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.
Figure 3
Figure 3
Bivariate normal mixture scatter plots, classification plots and validation plots. Cy3 (green channel) and Cy5 (red channel) scatter plots for the three data sets (AC) are shown. The gray ‘+’ marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden’s method for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using the bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.

References

    1. Nguyen D.V., Arpat,A.B., Wang,N. and Carroll,R.J. (2002) DNA microarray experiments: biological and technological aspects. Biometrics, 58, 701–717. - PubMed
    1. Ramaswamy S. and Golub,T.R. (2002) DNA microarrays in clinical oncology. J. Clin. Oncol., 20, 1932–1941. - PubMed
    1. Golub T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. - PubMed
    1. McLachlan G.J. and Basford,K.E. (1988) Mixture Models, Inference and Applications to Clustering. Marcel Dekker, New York.
    1. McLachlan G.J. (1982) In Krishnaiah,P.R. and Kanal,L.N. (eds), The classification and mixture maximum likelihood approaches to cluster analysis. Handbook of Statistics. North-Holland, Amsterdam, Vol. 2, pp. 199–208.

Publication types