Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 15;28(24):3290-7.
doi: 10.1093/bioinformatics/bts595. Epub 2012 Oct 9.

Bayesian correlated clustering to integrate multiple datasets

Affiliations

Bayesian correlated clustering to integrate multiple datasets

Paul Kirk et al. Bioinformatics. .

Abstract

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets.

Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI's performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Graphical representation of three DMA mixture models. (a) Independent case. (b) The MDI model. In both (a) and (b), formula image denotes the formula image observation in dataset k and is generated by mixture component formula image. The prior probabilities associated with the distinct component allocation variables, formula image, are given in the vector formula image, which is itself assigned a symmetric Dirichlet prior with parameter formula image. The parameter vector, formula image, for component c in dataset k is assigned a formula image prior. In (b), we additionally have formula image parameters, each of which models the dependence between the component allocations of observations in dataset k and formula image
Fig. 2.
Fig. 2.
(a) The data for the six-dataset synthetic example, separated into seven clusters. (b) A representation of how the cluster labels associated with each gene vary from dataset to dataset. Genes are ordered so that the clustering of Dataset 1 is the one that appears coherent. (c) A table showing the number of genes having the same cluster labels in datasets i and j. (d) A heatmap depiction of the similarity matrix formed by calculating the ARI between pairs of datasets
Fig. 3.
Fig. 3.
(a) Densities fitted to the sampled values of formula image. (b) Heatmap representation of the matrix with formula image-entry formula image, the posterior mean value for formula image
Fig. 4.
Fig. 4.
(a) Pairwise fusion probabilities for the 31 genes identified as fused across the ChIP and PPI datasets in the ‘Expression + ChIP + PPI’ example. Colours correspond to fused clusters and the dashed line indicates the fusion threshold. (b) Three-way fusion probabilities for the same 31 genes. Genes that do not exceed the fusion threshold have white bars. (c) The expression profiles for genes identified as fused according to the ChIP and PPI datasets. The coloured lines indicate genes that are also fused across the expression dataset as well

References

    1. Balasubramanian R., et al. A graph-theoretic approach to testing associations between disparate sources of functional genomics data. Bioinformatics. 2004;20:3353–3362. - PubMed
    1. Barash Y., Friedman N. Context-specific Bayesian clustering for gene expression data. J. Comput. Biol. 2002;9:169–191. - PubMed
    1. Brock G., et al. clValid: an R package for cluster validation. J. Stat. Softw. 2008;25:1–22.
    1. Carlson M., et al. org.Sc.sgd.db: genome wide annotation for Yeast. 2010. R package version 2.6.3.
    1. Cheng Y., Church G.M. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000;8:93–103. - PubMed

Publication types