Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 4:16:184.
doi: 10.1186/s12859-015-0614-0.

UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets

Affiliations

UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets

Basel Abu-Jamous et al. BMC Bioinformatics. .

Abstract

Background: Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets.

Results: Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn.

Conclusions: The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The structure of the six synthetic microarray datasets. The cluster C1 (g1 to g75) includes genes consistently co-expressed over all of the six datasets, and the cluster C2 (g76 to g160) includes genes consistently co-expressed only in the positive set of datasets (P1, P2, and P3) while being poorly co-expressed in the negative set of datasets (N1, N2, and N3). The rest of the genome (C0) includes genes poorly co-expressed everywhere
Fig. 2
Fig. 2
Synthetic data ground truth clusters C1 and C2 expression profiles. Each plot in this grid of plots shows the normalised expression profiles of the 75 and 85 genes respectively included in the ground truth clusters C1 and C2 in each of the six synthetic datasets. The horizontal axis is the samples axis whose range in each subplot is equal to the number of samples of the corresponding dataset. The vertical axis is the normalised expression value. Note that C1 is consistently co-expressed in all of the six datasets while C2 is only consistently co-expressed in the positive datasets P1, P2, and P3
Fig. 3
Fig. 3
Flow chart summary for UNCLES with type B of external specifications
Fig. 4
Fig. 4
M-N and F-P scatter plots of the synthetic data clusters C1 and C2 generated by UNCLES and by other methods. The selected clusters in the M-N plots are marked by solid grey circles, and their corresponding points in the F-P plots are marked by solid grey circles as well. The red stars in any of the M-N or F-P plots represent the clusters produced by the UNCLES method while the blue squares in the F-P plots represent the clusters produced by the other methods
Fig. 5
Fig. 5
Demonstration of the iterative process of selecting the best four clusters from both types A and B using M-N plots while analysing the synthetic datasets with a GS of 1,200. The union of the scattered black squares and red stars in the M-N plots of the first column represents all of the clusters generated at all of the K values and at all of the δ or (δ+, δ-) values. The big solid blue circle represents the best cluster, i.e. the cluster closest to the top left corner. The red stars represent the clusters which share at least one gene with that best cluster. Moving through the plots from the left to the right, the clusters marked by red stars are removed and the process is repeated iteratively over the remaining clusters. The first four iterations for types A and B are shown in this Figure
Fig. 6
Fig. 6
M-N and F-P scatter plots of the synthetic data clusters C1 and C2 generated by UNCLES, weighted by datasets’ numbers of samples, and by other methods. The selected clusters in the M-N plots are marked by solid grey circles, and their corresponding points in the F-P plots are marked by solid grey circles as well. The red stars in any of the M-N or F-P plots represent the clusters produced by the UNCLES method while the blue squares in the F-P plots represent the clusters produced by the other methods
Fig. 7
Fig. 7
F-P scatter plots of the best C1 and C2 clusters selected by the M-N plots after applying UNCLES types A and B to the synthetic datasets with added noise. Each sub-plot is an F-P scatter plot with ten points representing the best C1 or C2 cluster identified by each of the ten repetitions of the experiment of adding noise to the datasets, clustering by the UNCLES method, and cluster selection by the M-N scatter plots. This experiment has been performed for each of the five different adopted genome-sizes from 1,200 to 7,000. The scale of each F-P plot, which has been omitted for clarity, is from zero to unity for both dimensions
Fig. 8
Fig. 8
Synthetic data ground truth clusters C1 and C2 combined expression profiles from all of the six datasets. The vertical dashed lines show the boundaries between the samples belonging to each of the six datasets in their respective order of P1, P2, P3, N1, N2, and N3. C1 shows consistent co-expression over all of the combined 82 samples (data matrix columns), while C2 shows consistent co-expression only over the first 42 samples
Fig. 9
Fig. 9
Demonstration of the iterative process of selecting the best four yeast clusters from both types A and B using M-N plots. The union of the scattered black squares and red stars in the M-N plots of the first column represents all of the clusters generated at all of the K values and at all of the δ or (δ+, δ-) values. The big solid blue circle represents the best cluster, i.e. the cluster closest to the top left corner. The red stars represent the clusters which share at least one gene with that best cluster. Moving through the plots from the left to the right, the clusters marked by red stars are removed and the process is repeated iteratively over the remaining clusters. The first four iterations for types A and B are shown in this Figure
Fig. 10
Fig. 10
Distances from the top left corners of the M-N plots for the yeast clusters selected at the first six iterations for types A and B
Fig. 11
Fig. 11
The normalised genetic expression profiles of the genes included in the selected yeast clusters from both UNCLES types A and B in two S+ and two S- representative datasets

References

    1. Cahan P, Rovegno F, Mooney D, Newman JC, Laurent GS, McCaffrey TA. Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene. 2007;401:12–18. doi: 10.1016/j.gene.2007.06.016. - DOI - PMC - PubMed
    1. Nilsson R, Schultz IJ, Pierce EL, Soltis KA, Naranuntarat A, Ward DM, et al. Discovery of genes essential for heme biosynthesis through large-scale gene expression analysis. Cell Metab. 2009;10:119–130. doi: 10.1016/j.cmet.2009.06.012. - DOI - PMC - PubMed
    1. Piro RM, Ala U, Molineris I, Grassi E, Bracco C, Perego GP, et al. An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction. Eur J Hum Genet. 2011;19:1173–1180. doi: 10.1038/ejhg.2011.96. - DOI - PMC - PubMed
    1. Li KC. Genome-wide coexpression dynamics: theory and application. Proc Natl Acad Sci (PNAS) 2002;99:16875–16880. doi: 10.1073/pnas.252466999. - DOI - PMC - PubMed
    1. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. doi: 10.1126/science.1087447. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources