Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling

Musa H Asyali¹, Mohamed M Shoukri, Omer Demirkaya, Khalid S A Khabar

Affiliations

Affiliation

¹ Department of Biostatistics, Epidemiology, and Scientific Computing, King Faisal Specialist Hospital and Research Center, PO Box 3354, MBC-03, Riyadh, 11211, Saudi Arabia. asyali@kfshrc.edu.sa

PMID: 15113873
PMCID: PMC419441
DOI: 10.1093/nar/gkh544

Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling

Musa H Asyali et al. Nucleic Acids Res. 2004.

. 2004 Apr 27;32(8):2323-35.

doi: 10.1093/nar/gkh544. Print 2004.

Authors

Musa H Asyali¹, Mohamed M Shoukri, Omer Demirkaya, Khalid S A Khabar

Affiliation

¹ Department of Biostatistics, Epidemiology, and Scientific Computing, King Faisal Specialist Hospital and Research Center, PO Box 3354, MBC-03, Riyadh, 11211, Saudi Arabia. asyali@kfshrc.edu.sa

PMID: 15113873
PMCID: PMC419441
DOI: 10.1093/nar/gkh544

Abstract

DNA microarray is an important tool for the study of gene activities but the resultant data consisting of thousands of points are error-prone. A serious limitation in microarray analysis is the unreliability of the data generated from low signal intensities. Such data may produce erroneous gene expression ratios and cause unnecessary validation or post-analysis follow-up tasks. In this study, we describe an approach based on normal mixture modeling for determining optimal signal intensity thresholds to identify reliable measurements of the microarray elements and subsequently eliminate false expression ratios. We used univariate and bivariate mixture modeling to segregate the microarray data into two classes, low signal intensity and reliable signal intensity populations, and applied Bayesian decision theory to find the optimal signal thresholds. The bivariate analysis approach was found to be more accurate than the univariate approach; both approaches were superior to a conventional method when validated against a reference set of biological data that consisted of true and false gene expression data. Elimination of unreliable signal intensities in microarray data should contribute to the quality of microarray data including reproducibility and reliability of gene expression ratios.

PubMed Disclaimer

Figures

**Figure 1**
Univariate normal mixture histograms and threshold plots. Histograms in 60 bins from three different data sets (A–C) represent estimated normal density components for Cy3 green channel (top) and Cy5 red channel (bottom) channels. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or πN(x, µ₁, σ₁) and (1 – π)N(x, µ₂, σ₂) are shown by dashed lines and dotted lines, respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities, f_g(x) and f_r(x) (solid lines in top and bottom panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained in the text. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.

**Figure 2**
Analysis of the goodness of the fit between the Cy3 and Cy5 data channels and their two-component normal mixture model, using qq plots. The qq plots from three different datasets (A–C) show quantiles of the actual Cy3 green channel (left) or Cy5 red channel (right) data against the mixture model of each. The dashed line in each plot corresponds to the case of perfect agreement between the compared samples. If the ‘+’ marks, indicating the location of the matching quantiles, roughly lie on the straight line, then the distributions of the samples have the same shape except for a possible shift and rescaling.

**Figure 3**
Bivariate normal mixture scatter plots, classification plots and validation plots. Cy3 (green channel) and Cy5 (red channel) scatter plots for the three data sets (A–C) are shown. The gray ‘+’ marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden’s method for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using the bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.

See this image and copyright information in PMC

References

1. Nguyen D.V., Arpat,A.B., Wang,N. and Carroll,R.J. (2002) DNA microarray experiments: biological and technological aspects. Biometrics, 58, 701–717. - PubMed
1. Ramaswamy S. and Golub,T.R. (2002) DNA microarrays in clinical oncology. J. Clin. Oncol., 20, 1932–1941. - PubMed
1. Golub T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. - PubMed
1. McLachlan G.J. and Basford,K.E. (1988) Mixture Models, Inference and Applications to Clustering. Marcel Dekker, New York.
1. McLachlan G.J. (1982) In Krishnaiah,P.R. and Kanal,L.N. (eds), The classification and mixture maximum likelihood approaches to cluster analysis. Handbook of Statistics. North-Holland, Amsterdam, Vol. 2, pp. 199–208.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling

Affiliation

Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources