Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 May 12:6:115.
doi: 10.1186/1471-2105-6-115.

Visualization methods for statistical analysis of microarray clusters

Affiliations

Visualization methods for statistical analysis of microarray clusters

Matthew A Hibbs et al. BMC Bioinformatics. .

Abstract

Background: The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most appropriate to apply, and it is difficult to verify the results of any algorithm due to the lack of a gold-standard. Appropriate data visualization tools can aid this analysis process, but existing visualization methods do not specifically address this issue.

Results: We present several visualization techniques that incorporate meaningful statistics that are noise-robust for the purpose of analyzing the results of clustering algorithms on microarray data. This includes a rank-based visualization method that is more robust to noise, a difference display method to aid assessments of cluster quality and detection of outliers, and a projection of high dimensional data into a three dimensional space in order to examine relationships between clusters. Our methods are interactive and are dynamically linked together for comprehensive analysis. Further, our approach applies to both protein and gene expression microarrays, and our architecture is scalable for use on both desktop/laptop screens and large-scale display devices. This methodology is implemented in GeneVAnD (Genomic Visual ANalysis of Datasets) and is available at http://function.princeton.edu/GeneVAnD.

Conclusion: Incorporating relevant statistical information into data visualizations is key for analysis of large biological datasets, particularly because of high levels of noise and the lack of a gold-standard for comparisons. We developed several new visualization techniques and demonstrated their effectiveness for evaluating cluster quality and relationships between clusters.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of noise in microarray visualization. Four views of the same data displayed in different ways. (a-c) show a traditional display using different cutoff values. Note that in (a) variation in the highly over and under expressed regions cannot be seen due to saturation, while in (c) variation in the highly expressed regions can be seen, but variation near zero cannot. (d) uses our rank-based visualization method. In this rank-based view (d), the experiment with the lowest expression for each gene is colored black, the experiment with the highest expression is colored white, and the other experiments interpolate between in grayscale. Using this method, users can see the overall pattern of variation in the data, which makes it clear that heterogeneity in the traditional view is mostly the result of noise. (Data from [26])
Figure 2
Figure 2
Rank-based visualization of synthetic data. Synthetic data displayed (a) traditionally and (b) using our rank-based method. This data was generated by creating a single sinusoidal expression profile and for each gene (row) randomly shifting that profile up or down and introducing small amounts of Gaussian random noise throughout. The result is that the genes generally follow the same shape/trend over experiments, but the shapes are shifted up/down from one another. Traditional view (a) masks the similarity between genes, but their relationship is clear in the rank-based view (b).
Figure 3
Figure 3
Rank-based visualization of time series data. Yeast cell cycle data displayed (a) traditionally and (b) using our rank-based method. In the traditional visualization the top 4 genes (within the purple box) appear to be very different from the rest of the genes in this cluster. However, using the rank-based method it becomes clear that these genes follow the same general pattern of the entire cluster, with initially low expression building up to highest expression in the central time points and then falling to roughly middle values. (Data from [22])
Figure 4
Figure 4
Difference display visualization. Three clusters displayed traditionally on the left and in our difference image visualization on the right. In the difference display, the large top bar on each cluster shows the cluster average, each gene is displayed as its difference from that average (green indicates expressed less than the cluster average, red shows more expressed, and black means equally expressed with the cluster average). Cluster (a) is a coherent cluster of genes and appears very dark because of its homogeneity. Cluster (b) is another dark, uniform cluster, but it also contains one randomly inserted gene, which can be easily identified in our difference display. Cluster (c) contains a random selection of genes, and its randomness is clear from the brightness of the difference display. This difference display allows for quick assessment of overall cluster homogeneity and facilitates quick outlier detection. (Data and clusters a & b from [19])
Figure 5
Figure 5
Experiment variation display. A cluster displayed traditionally on the left and in our difference image visualization on the right also showing the standard deviation within the cluster for each experiment. Black on the standard deviation bar indicates a standard deviation of zero, while white indicates a higher value. Purple arrows point to several experiments in this cluster that show high variance. In general, the high variance among some experiments may indicate that this cluster is unregulated under those conditions. In this example, we can inspect the differences from the cluster average in the high variance experiments and see that for these conditions the upper group of genes (indicated by a red box) is less under expressed than the lower group of genes (indicated by a green box) which suggests that the cluster could be split into two sub-clusters to reduce this variation. The biological function of these genes is consistent with such a split (see web supplement for details, . Data and cluster from [19])
Figure 6
Figure 6
Dendrogram of averages. A dendrogram created from cluster averages with the genes in a cluster displayed below each average. The length of each branch of the tree is proportional to the distance between the averages. We create the hierarchy from the cluster averages, which allows us to show high level relationships between clusters generated by arbitrary clustering algorithms. (Data and clusters from [19])
Figure 7
Figure 7
Principal component projection visualization. A projection of genes from a cell cycle data set into a 3D space defined by user selected Principal Components. Genes in each cluster are colored by phase (Red-G1, Green-S, Blue-G2, Yellow-M, and Cyan-M/G1). Cluster averages are displayed by larger solid spheres. The much larger transparent spheres show the region included by one standard deviation away from the average. (a) shows the top ten PCs of this data set and the percent of variance accounted for by each PC. (b) is a projection of cell cycle genes onto a space defined by the 1st and 2nd PCs. The separation is poor due to the first PC being highly correlated to noise in this data set. (c) shows the same data projected into a space defined by the 2nd, 3rd, and 4th PCs. These PCs are highlighted in (a) corresponding to the axis colors in (c). Notice that the cell cycle phases are separated in order around the origin, and that G1 and M phase genes are opposite each other, which is consistent with their opposing expression profiles. (Data and clusters from [22]).
Figure 8
Figure 8
Multiple simultaneous views. A screenshot of GeneVAnD displaying clustered data. The panels shown are the expression level window on the left which can toggle between traditional, difference, and rank-based displays, and the PC projection window on the right. A selected gene is highlighted in blue in all views.
Figure 9
Figure 9
Large scale display. GeneVAnD in use on a large-scale display wall. The high resolution enables display of more information simultaneously and the large scale creates an environment conducive for collaboration between multiple researchers.

References

    1. Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A. 2001;98:8961–5. - PMC - PubMed
    1. Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics. 2001;17:309–18. - PubMed
    1. Mendez MA, Hodar C, Vulpe C, Gonzalez M, Cambiazo V. Discriminant analysis to evaluate clustering of gene expression data. FEBS Lett. 2002;522:24–8. - PubMed
    1. Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003;19:459–66. - PubMed
    1. Munich Information Center for Protein Sequences (MIPS) http://mips.gsf.de/

Publication types

MeSH terms

LinkOut - more resources