Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jan 12:7:17.
doi: 10.1186/1471-2105-7-17.

An approach for clustering gene expression data with error information

Affiliations

An approach for clustering gene expression data with error information

Brian Tjaden. BMC Bioinformatics. .

Abstract

Background: Clustering of gene expression patterns is a well-studied technique for elucidating trends across large numbers of transcripts and for identifying likely co-regulated genes. Even the best clustering methods, however, are unlikely to provide meaningful results if too much of the data is unreliable. With the maturation of microarray technology, a wealth of research on statistical analysis of gene expression data has encouraged researchers to consider error and uncertainty in their microarray experiments, so that experiments are being performed increasingly with repeat spots per gene per chip and with repeat experiments. One of the challenges is to incorporate the measurement error information into downstream analyses of gene expression data, such as traditional clustering techniques.

Results: In this study, a clustering approach is presented which incorporates both gene expression values and error information about the expression measurements. Using repeat expression measurements, the error of each gene expression measurement in each experiment condition is estimated, and this measurement error information is incorporated directly into the clustering algorithm. The algorithm, CORE (Clustering Of Repeat Expression data), is presented and its performance is validated using statistical measures. By using error information about gene expression measurements, the clustering approach is less sensitive to noise in the underlying data and it is able to achieve more accurate clusterings. Results are described for both synthetic expression data as well as real gene expression data from Escherichia coli and Saccharomyces cerevisiae.

Conclusion: The additional information provided by replicate gene expression measurements is a valuable asset in effective clustering. Gene expression profiles with high errors, as determined from repeat measurements, may be unreliable and may associate with different clusters, whereas gene expression profiles with low errors can be clustered with higher specificity. Results indicate that including error information from repeat gene expression measurements can lead to significant improvements in clustering accuracy.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Scatter plots of gene expression profiles. (a) A scatter plot of the expression profiles for 2 genes (with 6 components) with standard errors indicated, (b) A scatter plot of the expression profiles for 2 genes (with 6 components) identical to the expression profiles in (a), but with higher standard errors. The gene pairs in (a) and (b) have identical Euclidean distances, identical correlation coefficients, and identical error-weighted similarity. However, in the CORE clustering algorithm, genes whose expression measurements have higher error (g3 or g4) provide less information about which cluster the gene belongs to, and the gene makes less of a contribution toward the calculation of clustering parameters.
Figure 2
Figure 2
Error-weighted similarity examples. The figures (A) and (B) depict examples when error-weighted similarity (Eq. (1) in the text) is problematic as a correlation measure. (A) A scatter plot of the expression profiles for two genes g5 and g6 (with 3 components), g5 = (100, 300, 400) and g6 = (100, 300, 400). The plotted expression profiles fall exactly on a straight line, however the error-weighted similarity ρ˜5,6 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFbpGCgaacamaaBaaaleaacqaI1aqncqGGSaalcqaI2aGnaeqaaaaa@3180@ for these genes is only 0.79 when 05 = (10, 15, 50) and σ 6 = (30, 50, 15). (B)A scatter plot of the expression profiles for two genes g7 and g8 (with 3 components), g7 = (100, 300, 400) and g8 = (100, 400, 300). The plotted expression profiles do not fall on a straight line, however the error-weighted similarity ρ˜7,8 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFbpGCgaacamaaBaaaleaacqaI3aWncqGGSaalcqaI4aaoaeqaaaaa@3188@ for these genes is 1.0 when 01 = (20, 20, 50) and σ 8 = (20, 50, 20).
Figure 3
Figure 3
Transformed gene expression profiles. Four gene expression profiles across six experiments are depicted. The CORE algorithm uses two parameters, βi and γi, for each gene to reflect linear transformations of a gene's expression profile. The parameter β represents multiplicative scaling and the parameter γ represents additive translation. In the figure, the expression profile for gx is a translated version (β = 1, γ = -1) of the profile for gw, and the profile for gy is a scaled version (β = 2, γ = 0) of that for gw. Thus, the three expression profiles, gw, gx and gy, have the same shape, and all three are perfectly correlated, i.e., have a distance of zero from each other in the CORE algorithm. In contrast, the profiles for gw and gz have different shapes but are the closest in terms of Euclidean distance.
Figure 4
Figure 4
CORE algorithm. The figure provides a description of the CORE algorithm.
Figure 5
Figure 5
ROC curve for synthetic data. The ROC (receiver operating characteristic) curves show the tradeoffs between sensitivity and specificity (i.e., 1.0 - false negative rate) as the number of clusters is varied with two synthetic data sets. At each point along the curve, the sensitivity and specificity values are calculated as an average over 100 trials of generating synthetic data with a given number of classes and clustering the data with the same number of clusters as classes. (A) For the normally distributed synthetic expression data, the number of clusters is varied between 20 and 200. (B) For the periodic time series synthetic expression data, the number of clusters is varied between 2 and 20. The top curve (CORE) uses estimated standard error information from repeat measurements whereas the bottom curve uses a uniform error model, as described in the text.
Figure 6
Figure 6
Adjusted Rand index for synthetic data and real expression data. Each curve reflects the average adjusted Rand index Ra of clustering quality as the number of clusters is varied. Each data point on a curve is an average over 100 trials of generating and clustering data. Four clustering variations are considered for each data set: the CORE error model, a uniform error model, a Euclidean distance between pairs of expression profiles, and the error-weighted similarity measure between pairs of expression profiles. (A) The figure depicts the results for normally distributed synthetic data generated from 50 classes. (B) The figure depicts the results from periodic time series synthetic data generated from 4 classes. (C) The figure shows the results of clustering 904 E. coli genes belonging to 275 multi-gene operons based on expression data from 55 experiments. (D) Based on expression data from 20 experimental conditions, the figure shows the results of clustering 205 yeast genes which have each been annotated with one of four functional classifications.
Figure 7
Figure 7
Gap statistic for estimating the number of clusters. The figure shows the results of calculating the gap statistic for different numbers of clusters using the CORE algorithm for each of the four data sets. Each point along the curves represents a comparison between the within-cluster dispersion of the clustered data set, as determined by the CORE algorithm, and the average within-cluster dispersion of B = 100 samples of a clustered complete reference distribution, as determined by CORE. Generation of the reference distribution is described in the text. The error bars in the figure reflect the standard deviation of the gap statistic across the B reference distribution samples. The recommended value for the parameter k is the smallest number of clusters, i, such that the gap statistic at i is greater than or equal to the gap statistic at i+1, less the estimated error of the gap statistic at i+1. (A) For normally distributed synthetic data generated from 50 classes, the gap statistic suggests a parameter value of k = 60. (B) For periodic time series synthetic data generated from 4 classes, the gap statistic suggests a parameter value of k = 4. (C) For 904 E. coli genes belonging to 275 multi-gene operons, the gap statistic suggests a parameter value of k = 260. (D) For 205 yeast genes which have each been annotated with one of four functional classifications, the gap statistic suggests a parameter value of k = 5. For each of the four data sets, the gap statistic suggests a value for the parameter k which is close to the number of true classes in the data set.

Similar articles

Cited by

References

    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. - PMC - PubMed
    1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–285. - PubMed
    1. Hartuv E, Schmitt A, Lange J, Meirer-Ewert S, Lehrach H, Shamir R. An algorithm for clustering cDNAs for gene expression analysis. Proceedings for the Third Annual International Conference on Research in Computational Molecular Biology. 1999. pp. 188–197.
    1. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999;96:2907–2912. - PMC - PubMed
    1. Dasgupta A, Raftery AE. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association. 1998;93:294–302.

MeSH terms

LinkOut - more resources