. 2015 Jan 15;31(2):209-15.

doi: 10.1093/bioinformatics/btu518. Epub 2014 Sep 29.

Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation

Marcus A Badgeley¹, Stuart C Sealfon¹, Maria D Chikina¹

Affiliations

Affiliation

¹ Department of Neurology, Mount Sinai School of Medicine, New York, NY 10029 and Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA.

PMID: 25266226
PMCID: PMC4287939
DOI: 10.1093/bioinformatics/btu518

Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation

Marcus A Badgeley et al. Bioinformatics. 2015.

. 2015 Jan 15;31(2):209-15.

doi: 10.1093/bioinformatics/btu518. Epub 2014 Sep 29.

Authors

Marcus A Badgeley¹, Stuart C Sealfon¹, Maria D Chikina¹

Affiliation

¹ Department of Neurology, Mount Sinai School of Medicine, New York, NY 10029 and Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA.

PMID: 25266226
PMCID: PMC4287939
DOI: 10.1093/bioinformatics/btu518

Abstract

Motivation: Modern molecular technologies allow the collection of large amounts of high-throughput data on the functional attributes of genes. Often multiple technologies and study designs are used to address the same biological question such as which genes are overexpressed in a specific disease state. Consequently, there is considerable interest in methods that can integrate across datasets to present a unified set of predictions.

Results: An important aspect of data integration is being able to account for the fact that datasets may differ in how accurately they capture the biological signal of interest. While many methods to address this problem exist, they always rely either on dataset internal statistics, which reflect data structure and not necessarily biological relevance, or external gold standards, which may not always be available. We present a new rank aggregation method for data integration that requires neither external standards nor internal statistics but relies on Bayesian reasoning to assess dataset relevance. We demonstrate that our method outperforms established techniques and significantly improves the predictive power of rank-based aggregations. We show that our method, which does not require an external gold standard, provides reliable estimates of dataset relevance and allows the same set of data to be integrated differently depending on the specific signal of interest.

Availability: The method is implemented in R and is freely available at http://www.pitt.edu/~mchikina/BIRRA/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
We simulate a noisy rank aggregation task with two types of datasets. For 10 ‘signal’ datasets, the values for 50 differentially expressed genes are drawn from a $N$ (1,1) distribution, while the values for 950 background genes are drawn from a $N$ (0,1) distribution. For 30 ‘noise’ datasets, the values are drawn from the same distribution. We aggregate the data using different rank aggregation methods and compare the results to those obtained with an optimal naive Bayes (i.e. using the exact conditional distributions). The BIRRA algorithm outperforms other aggregation methods producing results that are between the optimal naive Bayes and established rank aggregation methods. AUC values: Mean Ranks 0.82, RRA 0.78, Stuart 0.85, BIRRA 0.91, Naive Bayes 0.99

**Fig. 2.**
Evolution of BIRRA computed Bayes factors for the dataset evaluated in Figure 1. At each iteration, the BIRRA algorithm computes dataset-specific Bayes factors against the current working standard. Bayes factors are plotted in black for the ‘signal’ datasets and in gray for the ‘noise’ datasets. In just a few iterations, BIRRA successfully down-weights the noise datasets

**Fig. 3.**
Varying the fraction of ‘noise’ datasets we find that BIRRA provides the most performance gains when the fraction of uninformative datasets is large and it performs as well as other methods when all the datasets have signal

**Fig. 4.**
Comparison of rank aggregation methods using a compendium of PD datasets. While there is no gold standard for gene expression changes associated with PD, we can judge the aggregation results based on how well the aggregated ranking reproduces the result of an independent study. To simulate this, we have taken a leave-one-out approach where we aggregate all but one of the studies and use the remaining study as a gold standard (top 500 genes are considered positive) to evaluate the aggregation results. We plot the resulting AUCs for different aggregation methods. The horizontal line represents the 99% confidence threshold for AUC being >0.5. We observe that leave-one-out aggregation is predictive for most of the datasets tested, resulting in significant AUCs. We also observe that in most cases BIRRA produces superior results. In particular, it is able to improve our ability to predict blood expression changes (GSE6613) from the remaining datasets that use brain tissue

**Fig. 5.**
Evaluating BIRRA estimated dataset relevance against dataset statistical properties. We plot the sum of log Bayes factors against maximum P-value of top 50 DE genes (A) and study sample size (B). We find that the BIRRA-estimated quality is unrelated to dataset intrinsic statistical properties

**Fig. 6.**
TF targets predicted from expression correlation. We aggregate the results of 10 different ES expression datasets to predict TF target interaction and evaluate the result using a ChIPseq dataset. We find that for most TFs tested BIRRA aggregation produced outperformed other methods and the best individual dataset

**Fig. 7.**
Comparing BIRRA estimates to Bayes factors computed using the independent ChIPseq dataset we find that the dataset weights as estimated by BIRRA are in agreement with their ability to recapitulate the ChIPseq signal. Importantly, datasets varied widely in their ability to recapitulate the ChIPseq interactions for different TFs, and the BIRRA-computed Bayes factors were consistent in each case

See this image and copyright information in PMC

References

1. Akey JM, et al. On the design and analysis of gene expression studies in human populations. Nat. Genet. 2007;39:807–808. ; author reply 808–809. - PubMed
1. Cao R, Zhang Y. SUZ12 is required for both the histone methyltransferase activity and the silencing function of the EED-EZH2 complex. Mol. Cell. 2004;15:57–67. - PubMed
1. Chen X, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. - PubMed
1. Coletta A, et al. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor. Genome Biol. 2012;13:R104. - PMC - PubMed
1. Edgar R, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation

Affiliation

Hybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical