Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar 26:11:158.
doi: 10.1186/1471-2105-11-158.

Integrating gene expression and GO classification for PCA by preclustering

Affiliations

Integrating gene expression and GO classification for PCA by preclustering

Jorn R De Haan et al. BMC Bioinformatics. .

Abstract

Background: Gene expression data can be analyzed by summarizing groups of individual gene expression profiles based on GO annotation information. The mean expression profile per group can then be used to identify interesting GO categories in relation to the experimental settings. However, the expression profiles present in GO classes are often heterogeneous, i.e., there are several different expression profiles within one class. As a result, important experimental findings can be obscured because the summarizing profile does not seem to be of interest. We propose to tackle this problem by finding homogeneous subclasses within GO categories: preclustering.

Results: Two microarray datasets are analyzed. First, a selection of genes from a well-known Saccharomyces cerevisiae dataset is used. The GO class "cell wall organization and biogenesis" is shown as a specific example. After preclustering, this term can be associated with different phases in the cell cycle, where it could not be associated with a specific phase previously. Second, a dataset of differentiation of human Mesenchymal Stem Cells (MSC) into osteoblasts is used. For this dataset results are shown in which the GO term "skeletal development" is a specific example of a heterogeneous GO class for which better associations can be made after preclustering. The Intra Cluster Correlation (ICC), a measure of cluster tightness, is applied to identify relevant clusters.

Conclusions: We show that this method leads to an improved interpretability of results in Principal Component Analysis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of heterogeneous expression profiles within a single GO class. Expression profiles for genes annotated with the term GO:0007047 are depicted here for the cdc15 synchronization of the Saccharomyces cerevisiae Cell Cycle dataset [12]. From the first subpicture, containing all profiles simultaneously (A), it is clear that there is big variation within the profiles. More homogeneous subgroups, with shifted or even anti-correlated profiles in time, can be identified (B, C, D).
Figure 2
Figure 2
Plot showing the relation between the ICC of the grouped genes from original GO classes (x-axis) and the new subgroups (y-axis). Each point represents a class. The GO term GO:0007042 is marked with three asterisks, one for each subclass.
Figure 3
Figure 3
Visual representation of PCA results for the YCC dataset. The PCA results without (A) and with preclustering (B) are shown. Several categories for the preclustered PCA are more outward and at a different location than for the PCA without preclustering. Categories (PCA scores) are shown as points, which can be correlated with phases of the cell cycle (connected by lines). Dark points have a ICC which is larger than 0.2. Specifically marked is the category "cell wall organization and biogenesis" (GO:0007047, represented by star symbols). Cell phases are indicated by names (G1, S, G2, M and G1/M) also used by [12].
Figure 4
Figure 4
Representation of the ICC for GO categories evaluated with PCA for the MSC dataset. The ICC values of the whole GO class are on the x-axis and the ICC of the corresponding subgroup(s) are on the y-axis. A group of classes expected to be involved in the MSC dataset is marked with black dots, other classes are marked with grey dots.
Figure 5
Figure 5
Visual representation of PCA results for the MSC dataset, without (A) and with preclustering (B). The dots (scores) represent GO categories or subgroups, and the arrows (loadings) are the treatments with which the categories can be correlated (indicated with DEX, BMP, VIT and UNT). Only a subset of GO classes is depicted, to focus on cell differentiation and osteogenesis. In Figure 5A the 24 original GO classes are shown and in Figure 5B the 79 subclasses. The stars identify the GO term GO:0001501 (skeletal development).
Figure 6
Figure 6
ROC plot to exemplify the improved identification of interesting terms by performing preclustering. The sensitivity and specificity of identification of 24 relevant GO terms was calculated to draw the lines. The curve generated from the pleclustered data (grey line) is more sensitive and specific than the original data without preclustering (black line).

Similar articles

Cited by

References

    1. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. - DOI - PMC - PubMed
    1. Tavazoie S, Hughes J, Campbell M, Cho R, Church G. Systematic determination of genetic network architecture. Nature genetics. 1999;22:281–285. doi: 10.1038/10343. - DOI - PubMed
    1. Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W. Model-based clustering and data transformations for gene expresison data. Bioinformatics. 2001;17:977–987. doi: 10.1093/bioinformatics/17.10.977. - DOI - PubMed
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. - DOI - PMC - PubMed
    1. Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations and open problems. Bioinformatics. 2005;21:3587–3595. doi: 10.1093/bioinformatics/bti565. - DOI - PMC - PubMed

Publication types