Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Nov 18:11:567.
doi: 10.1186/1471-2105-11-567.

Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

Affiliations
Comparative Study

Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

Christoph Bartenhagen et al. BMC Bioinformatics. .

Abstract

Background: Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data.

Results: A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods.

Conclusions: Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Benchmark. Our benchmark, consisting of three independent procedures. (1) Dimension reduction: Every dataset is mapped into a low-dimensional target space. All necessary parameters are determined by a loo-cv with a SVM. (2) Classification: Every dataset is classified by a SVM with Gaussian kernel during 100 randomization steps. A gradient descend procedure estimates all SVM parameters. (3) Cluster validation: The Davis-Bouldin-Index measures the distance between labeled clusters within the low-dimensional data.
Figure 2
Figure 2
Visualization. Two dimensional visualization example of the Haferlach et al. Leukemia dataset. The LLE and Isomap embedding show more distinct clusters than the first two principal components of a PCA.
Figure 3
Figure 3
Randomization accuracies. Support Vector Machine classification accuracies of the Wang et al. breast cancer dataset. The data was randomized a hundred times for fixed target space dimensions two (left) and three (center) and for dimensions estimated by loo-cv (right). In the last case, the plot also shows the results for the original high-dimensional data without reduction. Especially in two and three dimensions, all nonlinear methods are superior to PCA.
Figure 4
Figure 4
Cluster validation. Davis-Bouldin-Indices of the reduced Wang et al. breast cancer dataset for fixed target space dimensions 2,3,5 and 10. In most cases, the nonlinear methods produce more distinct clusters than PCA.
Figure 5
Figure 5
Noise evaluation. Randomization accuracies (left) and cluster-indices (right) for the Wang et al. breast cancer dataset combined with normally distributed noise with zero mean and different variances. PCA and DM react most stable on noise. All other methods lead to varying accuracies and cluster scores.
Figure 6
Figure 6
Simulated data. Leave-one-out cross-validation accuracies of the simulated datasets with increasing differential features. The red line marks the average accuracy of all 100 generated datasets within the 95% quantile. In two dimensional target spaces, LLE an IM capture more of the underlying structure of the data with much fewer significant features than PCA.

References

    1. Hibbs MA, Dirksen NC, Li K, Troyanskaya OG. Visualization methods for statistical analysis of microarray clusters. BMC Bioinformatics. 2005;6:115. doi: 10.1186/1471-2105-6-115. - DOI - PMC - PubMed
    1. Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics. 2001;17(9):763–774. doi: 10.1093/bioinformatics/17.9.763. - DOI - PubMed
    1. Lim IS, Ciechomski PDH, Sarni S, Thalmann D. Planar arrangement of high-dimensional biomedical data sets by Isomap coordinates. In Proceedings of the 16 th IEEE Symposium on Computer-Based Medical Systems. 2003. pp. 50–55.
    1. Baek J, McLachlan GJ, Flack LK. Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010;32:1298–1309. doi: 10.1109/TPAMI.2009.149. - DOI - PubMed
    1. Butte A. The use and analysis of microarray data. Nature Reviews Drug Discovery. 2002;1(12):951–960. doi: 10.1038/nrd961. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources