Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 21;372(2016):20130136.
doi: 10.1098/rsta.2013.0136. Print 2014 May 28.

On integrating multi-experiment microarray data

Affiliations

On integrating multi-experiment microarray data

Georgia Tsiliki et al. Philos Trans A Math Phys Eng Sci. .

Abstract

With the extensive use of microarray technology as a potential prognostic and diagnostic tool, the comparison and reproducibility of results obtained from the use of different platforms is of interest. The integration of those datasets can yield more informative results corresponding to numerous datasets and microarray platforms. We developed a novel integration technique for microarray gene-expression data derived by different studies for the purpose of a two-way Bayesian partition modelling which estimates co-expression profiles under subsets of genes and between biological samples or experimental conditions. The suggested methodology transforms disparate gene-expression data on a common probability scale to obtain inter-study-validated gene signatures. We evaluated the performance of our model using artificial data. Finally, we applied our model to six publicly available cancer gene-expression datasets and compared our results with well-known integrative microarray data methods. Our study shows that the suggested framework can relieve the limited sample size problem while reporting high accuracies by integrating multi-experiment data.

Keywords: composite likelihood; gene expression; integrative genomics; partition modelling.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Illustration of the model-based integration scheme for multi-platform gene-expression data. The unified data X are considered, consisting of the set of common genes and different biological samples of the same pathology. In step (a), gene-expression values are assigned to biclusters based on the binary indicator matrix R, where yellow values denote 0 and blue values denote 1 say. In step (b), we show how blue biclusters in A surrounded by yellow biclusters, and vice versa, are considered to form five individual biclusters, where the different colour scheme denotes the different normal distributions. In step (c), we select at random the dark blue-coloured bicluster to be the C0 bicluster. The remaining biclusters follow different normal distributions denoted by the different colours.
Figure 2.
Figure 2.
Correspondence at the top and integration-driven discovery rate plots on simulated data. (a) CAT plot showing the proportions of genes in common between the reported and the 500 true differentially expressed genes on simulated data. Results are shown for S1, S2, the integrated dataset using our methodology, S1+S2, and GMM-IG method. A randomly selected list of 500 genes is also included (black line). (b) The integration-driven discovery rate (IDR) plot for the proportions of genes that were found to be differentially expressed only in the integrated data S1+S2 using our methodology. (c) The IDR plot for the proportions of genes that were found to be differentially expressed only in the integrated data using the GMM-IG framework. In all cases, differentially expressed genes are ranked by Welch t-test corrected p-values. (Online version in colour.)
Figure 3.
Figure 3.
DR plot for the proportions of genes that were found to be differentially expressed only in the breast cancer-integrated data. IDR was computed for the integrated, using our methodology, dataset formula image, relative to the threshold imposed to the corrected Welch t-test p-values. IDR plots for XPN and GMM-IG transformation methods are also shown. (Online version in colour.)
Figure 4.
Figure 4.
DR plot for the proportions of genes that were found to be differentially expressed only in the lung cancer-integrated data. IDR was computed for the integrated, using our methodology, dataset formula image, relative to the threshold imposed to ANOVA p-values. IDR plots for XPN and GMM-IG transformation methods are also shown. (Online version in colour.)

References

    1. Larsson O, Sandberg R. 2006. Lack of correct data format and comparability limits future integrative microarray research. Nat. Biotechnol. 24, 1322–1323. ( 10.1038/nbt1106-1322) - DOI - PubMed
    1. Whetzel PL, et al. 2006. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–873. ( 10.1093/bioinformatics/btl005) - DOI - PubMed
    1. Strauss E. 2006. Arrays of hope. Cell 127, 657–659. ( 10.1016/j.cell.2006.11.005) - DOI - PubMed
    1. Kim IF, Soboleva A, Tomashevsky M, Edgar R. 2007. NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res. 35, D760–D765. ( 10.1093/nar/gkl887) - DOI - PMC - PubMed
    1. Parkinson H, et al. 2007. ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750. ( 10.1093/nar/gkl995) - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources