An approach for clustering gene expression data with error information

Brian Tjaden¹

Affiliations

PMID: 16409635
PMCID: PMC1360687
DOI: 10.1186/1471-2105-7-17

An approach for clustering gene expression data with error information

Brian Tjaden. BMC Bioinformatics. 2006.

. 2006 Jan 12:7:17.

doi: 10.1186/1471-2105-7-17.

Author

Brian Tjaden¹

Affiliation

¹ Computer Science Department, Wellesley College, Wellesley, MA 02481, USA. btjaden@wellesley.edu

PMID: 16409635
PMCID: PMC1360687
DOI: 10.1186/1471-2105-7-17

Abstract

Background: Clustering of gene expression patterns is a well-studied technique for elucidating trends across large numbers of transcripts and for identifying likely co-regulated genes. Even the best clustering methods, however, are unlikely to provide meaningful results if too much of the data is unreliable. With the maturation of microarray technology, a wealth of research on statistical analysis of gene expression data has encouraged researchers to consider error and uncertainty in their microarray experiments, so that experiments are being performed increasingly with repeat spots per gene per chip and with repeat experiments. One of the challenges is to incorporate the measurement error information into downstream analyses of gene expression data, such as traditional clustering techniques.

Results: In this study, a clustering approach is presented which incorporates both gene expression values and error information about the expression measurements. Using repeat expression measurements, the error of each gene expression measurement in each experiment condition is estimated, and this measurement error information is incorporated directly into the clustering algorithm. The algorithm, CORE (Clustering Of Repeat Expression data), is presented and its performance is validated using statistical measures. By using error information about gene expression measurements, the clustering approach is less sensitive to noise in the underlying data and it is able to achieve more accurate clusterings. Results are described for both synthetic expression data as well as real gene expression data from Escherichia coli and Saccharomyces cerevisiae.

Conclusion: The additional information provided by replicate gene expression measurements is a valuable asset in effective clustering. Gene expression profiles with high errors, as determined from repeat measurements, may be unreliable and may associate with different clusters, whereas gene expression profiles with low errors can be clustered with higher specificity. Results indicate that including error information from repeat gene expression measurements can lead to significant improvements in clustering accuracy.

PubMed Disclaimer

Figures

**Figure 1**
**Scatter plots of gene expression profiles.** (a) A scatter plot of the expression profiles for 2 genes (with 6 components) with standard errors indicated, (b) A scatter plot of the expression profiles for 2 genes (with 6 components) identical to the expression profiles in (a), but with higher standard errors. The gene pairs in (a) and (b) have identical Euclidean distances, identical correlation coefficients, and identical error-weighted similarity. However, in the CORE clustering algorithm, genes whose expression measurements have higher error (g₃or g₄) provide less information about which cluster the gene belongs to, and the gene makes less of a contribution toward the calculation of clustering parameters.

**Figure 3**
**Transformed gene expression profiles.** Four gene expression profiles across six experiments are depicted. The CORE algorithm uses two parameters, β_iand γ_i, for each gene to reflect linear transformations of a gene's expression profile. The parameter β represents multiplicative scaling and the parameter γ represents additive translation. In the figure, the expression profile for g_xis a translated version (β = 1, γ = -1) of the profile for g_w, and the profile for g_yis a scaled version (β = 2, γ = 0) of that for g_w. Thus, the three expression profiles, g_w, g_xand g_y, have the same shape, and all three are perfectly correlated, i.e., have a distance of zero from each other in the CORE algorithm. In contrast, the profiles for g_wand g_zhave different shapes but are the closest in terms of Euclidean distance.

**Figure 4**
**CORE algorithm.** The figure provides a description of the CORE algorithm.

**Figure 5**
**ROC curve for synthetic data.** The ROC (receiver operating characteristic) curves show the tradeoffs between sensitivity and specificity (i.e., 1.0 - false negative rate) as the number of clusters is varied with two synthetic data sets. At each point along the curve, the sensitivity and specificity values are calculated as an average over 100 trials of generating synthetic data with a given number of classes and clustering the data with the same number of clusters as classes. (A) For the normally distributed synthetic expression data, the number of clusters is varied between 20 and 200. (B) For the periodic time series synthetic expression data, the number of clusters is varied between 2 and 20. The top curve (CORE) uses estimated standard error information from repeat measurements whereas the bottom curve uses a uniform error model, as described in the text.

**Figure 6**
**Adjusted Rand index for synthetic data and real expression data.** Each curve reflects the average adjusted Rand index R_aof clustering quality as the number of clusters is varied. Each data point on a curve is an average over 100 trials of generating and clustering data. Four clustering variations are considered for each data set: the CORE error model, a uniform error model, a Euclidean distance between pairs of expression profiles, and the error-weighted similarity measure between pairs of expression profiles. (A) The figure depicts the results for normally distributed synthetic data generated from 50 classes. (B) The figure depicts the results from periodic time series synthetic data generated from 4 classes. (C) The figure shows the results of clustering 904 *E. coli* genes belonging to 275 multi-gene operons based on expression data from 55 experiments. (D) Based on expression data from 20 experimental conditions, the figure shows the results of clustering 205 yeast genes which have each been annotated with one of four functional classifications.

**Figure 7**
**Gap statistic for estimating the number of clusters.** The figure shows the results of calculating the gap statistic for different numbers of clusters using the CORE algorithm for each of the four data sets. Each point along the curves represents a comparison between the within-cluster dispersion of the clustered data set, as determined by the CORE algorithm, and the average within-cluster dispersion of B = 100 samples of a clustered complete reference distribution, as determined by CORE. Generation of the reference distribution is described in the text. The error bars in the figure reflect the standard deviation of the gap statistic across the B reference distribution samples. The recommended value for the parameter k is the smallest number of clusters, i, such that the gap statistic at i is greater than or equal to the gap statistic at i+1, less the estimated error of the gap statistic at i+1. (A) For normally distributed synthetic data generated from 50 classes, the gap statistic suggests a parameter value of k = 60. (B) For periodic time series synthetic data generated from 4 classes, the gap statistic suggests a parameter value of k = 4. (C) For 904 *E. coli* genes belonging to 275 multi-gene operons, the gap statistic suggests a parameter value of k = 260. (D) For 205 yeast genes which have each been annotated with one of four functional classifications, the gap statistic suggests a parameter value of k = 5. For each of the four data sets, the gap statistic suggests a value for the parameter k which is close to the number of true classes in the data set.

See this image and copyright information in PMC

References

1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. - PMC - PubMed
1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–285. - PubMed
1. Hartuv E, Schmitt A, Lange J, Meirer-Ewert S, Lehrach H, Shamir R. An algorithm for clustering cDNAs for gene expression analysis. Proceedings for the Third Annual International Conference on Research in Computational Molecular Biology. 1999. pp. 188–197.
1. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999;96:2907–2912. - PMC - PubMed
1. Dasgupta A, Raftery AE. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association. 1998;93:294–302.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An approach for clustering gene expression data with error information

Affiliation

An approach for clustering gene expression data with error information

Author

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases