. 2012 Dec 15;28(24):3290-7.

doi: 10.1093/bioinformatics/bts595. Epub 2012 Oct 9.

Bayesian correlated clustering to integrate multiple datasets

Paul Kirk¹, Jim E Griffin, Richard S Savage, Zoubin Ghahramani, David L Wild

Affiliations

PMID: 23047558
PMCID: PMC3519452
DOI: 10.1093/bioinformatics/bts595

Bayesian correlated clustering to integrate multiple datasets

Paul Kirk et al. Bioinformatics. 2012.

. 2012 Dec 15;28(24):3290-7.

doi: 10.1093/bioinformatics/bts595. Epub 2012 Oct 9.

Authors

Paul Kirk¹, Jim E Griffin, Richard S Savage, Zoubin Ghahramani, David L Wild

Affiliation

¹ Systems Biology Centre, University of Warwick, Coventry, CV4 7AL, UK.

PMID: 23047558
PMCID: PMC3519452
DOI: 10.1093/bioinformatics/bts595

Abstract

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets.

Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI's performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.

PubMed Disclaimer

Figures

**Fig. 1.**
Graphical representation of three DMA mixture models. (a) Independent case. (b) The MDI model. In both (a) and (b), denotes the observation in dataset k and is generated by mixture component . The prior probabilities associated with the distinct component allocation variables, , are given in the vector , which is itself assigned a symmetric Dirichlet prior with parameter . The parameter vector, , for component c in dataset k is assigned a prior. In (b), we additionally have parameters, each of which models the dependence between the component allocations of observations in dataset k and

formula image — **Fig. 1.**
Graphical representation of three DMA mixture models. (a) Independent case. (b) The MDI model. In both (a) and (b), denotes the observation in dataset k and is generated by mixture component . The prior probabilities associated with the distinct component allocation variables, , are given in the vector , which is itself assigned a symmetric Dirichlet prior with parameter . The parameter vector, , for component c in dataset k is assigned a prior. In (b), we additionally have parameters, each of which models the dependence between the component allocations of observations in dataset k and

**Fig. 2.**
(a) The data for the six-dataset synthetic example, separated into seven clusters. (b) A representation of how the cluster labels associated with each gene vary from dataset to dataset. Genes are ordered so that the clustering of Dataset 1 is the one that appears coherent. (c) A table showing the number of genes having the same cluster labels in datasets i and j. (d) A heatmap depiction of the similarity matrix formed by calculating the ARI between pairs of datasets

**Fig. 3.**
(a) Densities fitted to the sampled values of . (b) Heatmap representation of the matrix with -entry , the posterior mean value for

**Fig. 4.**
(a) Pairwise fusion probabilities for the 31 genes identified as fused across the ChIP and PPI datasets in the ‘Expression + ChIP + PPI’ example. Colours correspond to fused clusters and the dashed line indicates the fusion threshold. (b) Three-way fusion probabilities for the same 31 genes. Genes that do not exceed the fusion threshold have white bars. (c) The expression profiles for genes identified as fused according to the ChIP and PPI datasets. The coloured lines indicate genes that are also fused across the expression dataset as well

See this image and copyright information in PMC

References

1. Balasubramanian R., et al. A graph-theoretic approach to testing associations between disparate sources of functional genomics data. Bioinformatics. 2004;20:3353–3362. - PubMed
1. Barash Y., Friedman N. Context-specific Bayesian clustering for gene expression data. J. Comput. Biol. 2002;9:169–191. - PubMed
1. Brock G., et al. clValid: an R package for cluster validation. J. Stat. Softw. 2008;25:1–22.
1. Carlson M., et al. org.Sc.sgd.db: genome wide annotation for Yeast. 2010. R package version 2.6.3.
1. Cheng Y., Church G.M. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000;8:93–103. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

G0902104/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bayesian correlated clustering to integrate multiple datasets

Affiliation

Bayesian correlated clustering to integrate multiple datasets

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases