. 2015 Jun 4:16:184.

doi: 10.1186/s12859-015-0614-0.

UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets

Basel Abu-Jamous¹, Rui Fa², David J Roberts^{3

4}, Asoke K Nandi^{5

6}

Affiliations

¹ Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK. basel.abujamous@brunel.ac.uk.
² Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK. rui.fa@brunel.ac.uk.
³ National Health Service Blood and Transplant, Oxford, OX3 9BQ, UK. david.roberts@ndcls.ox.ac.uk.
⁴ Radcliffe Department of Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, UK. david.roberts@ndcls.ox.ac.uk.
⁵ Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK. asoke.nandi@brunel.ac.uk.
⁶ Department of Mathematical Information Technology, University of Jyväskylä, Jyväskylä, Finland. asoke.nandi@brunel.ac.uk.

PMID: 26040489
PMCID: PMC4453228
DOI: 10.1186/s12859-015-0614-0

UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets

Basel Abu-Jamous et al. BMC Bioinformatics. 2015.

. 2015 Jun 4:16:184.

doi: 10.1186/s12859-015-0614-0.

Authors

Basel Abu-Jamous¹, Rui Fa², David J Roberts^{3

4}, Asoke K Nandi^{5

6}

Affiliations

¹ Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK. basel.abujamous@brunel.ac.uk.
² Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK. rui.fa@brunel.ac.uk.
³ National Health Service Blood and Transplant, Oxford, OX3 9BQ, UK. david.roberts@ndcls.ox.ac.uk.
⁴ Radcliffe Department of Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, UK. david.roberts@ndcls.ox.ac.uk.
⁵ Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, Middlesex, UB8 3PH, UK. asoke.nandi@brunel.ac.uk.
⁶ Department of Mathematical Information Technology, University of Jyväskylä, Jyväskylä, Finland. asoke.nandi@brunel.ac.uk.

PMID: 26040489
PMCID: PMC4453228
DOI: 10.1186/s12859-015-0614-0

Abstract

Background: Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets.

Results: Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn.

Conclusions: The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.

PubMed Disclaimer

Figures

**Fig. 1**
The structure of the six synthetic microarray datasets. The cluster C1 (*g1 to g75*) includes genes consistently co-expressed over all of the six datasets, and the cluster C2 (*g76 to g160*) includes genes consistently co-expressed only in the positive set of datasets (*P1, P2, and P3*) while being poorly co-expressed in the negative set of datasets (*N1, N2, and N3*). The rest of the genome (C0) includes genes poorly co-expressed everywhere

**Fig. 2**
Synthetic data ground truth clusters C1 and C2 expression profiles. Each plot in this grid of plots shows the normalised expression profiles of the 75 and 85 genes respectively included in the ground truth clusters C1 and C2 in each of the six synthetic datasets. The horizontal axis is the samples axis whose range in each subplot is equal to the number of samples of the corresponding dataset. The vertical axis is the normalised expression value. Note that C1 is consistently co-expressed in all of the six datasets while C2 is only consistently co-expressed in the positive datasets *P1, P2,* and P3

**Fig. 3**
Flow chart summary for UNCLES with type B of external specifications

**Fig. 4**
*M-N* and *F-P* scatter plots of the synthetic data clusters C1 and C2 generated by UNCLES and by other methods. The selected clusters in the M-N plots are marked by solid grey circles, and their corresponding points in the *F-P* plots are marked by solid grey circles as well. The red stars in any of the *M-N* or *F-P* plots represent the clusters produced by the UNCLES method while the blue squares in the *F-P* plots represent the clusters produced by the other methods

**Fig. 5**
Demonstration of the iterative process of selecting the best four clusters from both types A and B using *M-N* plots while analysing the synthetic datasets with a GS of 1,200. The union of the scattered black squares and red stars in the *M-N* plots of the first column represents all of the clusters generated at all of the K values and at all of the δ or (δ+, δ-) values. The big solid blue circle represents the best cluster, i.e. the cluster closest to the top left corner. The red stars represent the clusters which share at least one gene with that best cluster. Moving through the plots from the left to the right, the clusters marked by red stars are removed and the process is repeated iteratively over the remaining clusters. The first four iterations for types A and B are shown in this Figure

**Fig. 6**
M-N and F-P scatter plots of the synthetic data clusters C1 and C2 generated by *UNCLES,* weighted by datasets’ numbers of samples, and by other methods. The selected clusters in the M-N plots are marked by solid grey circles, and their corresponding points in the F-P plots are marked by solid grey circles as well. The red stars in any of the *M-N* or *F-P* plots represent the clusters produced by the UNCLES method while the blue squares in the F-P plots represent the clusters produced by the other methods

**Fig. 7**
*F-P* scatter plots of the best C1 and C2 clusters selected by the M-N plots after applying UNCLES types A and B to the synthetic datasets with added noise. Each sub-plot is an *F-P* scatter plot with ten points representing the best C1 or C2 cluster identified by each of the ten repetitions of the experiment of adding noise to the datasets, clustering by the UNCLES method, and cluster selection by the M-N scatter plots. This experiment has been performed for each of the five different adopted genome-sizes from 1,200 to 7,000. The scale of each F-P plot, which has been omitted for clarity, is from zero to unity for both dimensions

**Fig. 8**
Synthetic data ground truth clusters C1 and C2 combined expression profiles from all of the six datasets. The vertical dashed lines show the boundaries between the samples belonging to each of the six datasets in their respective order of *P1, P2, P3, N1, N2,* and *N3. C1* shows consistent co-expression over all of the combined 82 samples (data matrix columns), while C2 shows consistent co-expression only over the first 42 samples

**Fig. 9**
Demonstration of the iterative process of selecting the best four yeast clusters from both types A and B using *M-N* plots. The union of the scattered black squares and red stars in the *M-N* plots of the first column represents all of the clusters generated at all of the K values and at all of the δ or (δ+, δ-) values. The big solid blue circle represents the best cluster, i.e. the cluster closest to the top left corner. The red stars represent the clusters which share at least one gene with that best cluster. Moving through the plots from the left to the right, the clusters marked by red stars are removed and the process is repeated iteratively over the remaining clusters. The first four iterations for types A and B are shown in this Figure

**Fig. 10**
Distances from the top left corners of the M-N plots for the yeast clusters selected at the first six iterations for types A and B

**Fig. 11**
The normalised genetic expression profiles of the genes included in the selected yeast clusters from both UNCLES types A and B in two S+ and two S- representative datasets

See this image and copyright information in PMC

References

1. Cahan P, Rovegno F, Mooney D, Newman JC, Laurent GS, McCaffrey TA. Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene. 2007;401:12–18. doi: 10.1016/j.gene.2007.06.016. - DOI - PMC - PubMed
1. Nilsson R, Schultz IJ, Pierce EL, Soltis KA, Naranuntarat A, Ward DM, et al. Discovery of genes essential for heme biosynthesis through large-scale gene expression analysis. Cell Metab. 2009;10:119–130. doi: 10.1016/j.cmet.2009.06.012. - DOI - PMC - PubMed
1. Piro RM, Ala U, Molineris I, Grassi E, Bracco C, Perego GP, et al. An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction. Eur J Hum Genet. 2011;19:1173–1180. doi: 10.1038/ejhg.2011.96. - DOI - PMC - PubMed
1. Li KC. Genome-wide coexpression dynamics: theory and application. Proc Natl Acad Sci (PNAS) 2002;99:16875–16880. doi: 10.1073/pnas.252466999. - DOI - PMC - PubMed
1. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. doi: 10.1126/science.1087447. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets

Affiliations

UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases