Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 1:2018:bay011.
doi: 10.1093/database/bay011.

BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology

Affiliations

BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology

Kleanthi Lakiotaki et al. Database (Oxford). .

Abstract

Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, 'high quality' curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome's utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/.Database URL: http://dataome.mensxmachina.org/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Disease distribution of BioDataome’s datasets per species and measured technology. Disease categories correspond to parent disease nodes according to D-O (http://disease-ontology.org/).
Figure 2.
Figure 2.
Dataset distribution per disease category on the children nodes of D-O.
Figure 3.
Figure 3.
Flowchart of dataset annotation process.
Figure 4.
Figure 4.
Network of inter-dataset duplicate samples (top left). Each node represents a dataset and edges connect datasets that share at least one sample. Node degree (top right) and component size (bottom right) distribution of the sample duplication network. The four largest maximal cliques (bottom). Orange represents clique nodes and blue the rest datasets of each component.
Figure 5.
Figure 5.
Datasets with (dark blue) and without (light blue) common samples for all arrays.
Figure 6.
Figure 6.
Violin plots of RFC2 gene expression in chlamydia samples vs. all other samples in GPL570 array (left plot) and in pleural disease samples vs. all other samples in GPL570 array (right plot). P-value in the chlamydia case is almost zero, meaning that the distributions (green vs. orange) differ statistically significantly, whereas in the pleural disease case, the combined p-value of the two statistical tests (skewness, kurtosis) was 0.82 and thus the null hypothesis that the shapes of the two distributions are similar could not be rejected.
Figure 7.
Figure 7.
Percentage of genes with statistically significantly different distribution among diseases. Colors indicate disease categories according to the D-O.
Figure 8.
Figure 8.
GO enrichment analysis of the two (C1: genes with statistically significantly different distributions in all diseases and C2: genes that are statistically significantly different distributions in at most 20 diseases). GO annotation was based on Homo Sapiens OrgDb object. Color gradient ranges from red to blue. Red indicates low adjusted p-values (high enrichment), and blue indicates high adjusted p-values (low enrichment). Dot size corresponds to the count of ‘GeneRatio’.
Figure 9.
Figure 9.
Cytokine–cytokine receptor interaction pathway: KEGG pathway with the highest enrichment score on the gene set of differentiated genes in all diseases. Highlighted with red are the genes that belong to this gene set. Pathway was visualized with pahtview R/Bioconductor package (40).

References

    1. Rung J., Brazma A. (2012) Reuse of public genome-wide gene expression data. Nat. Rev. Genet., 14, 89–99. - PubMed
    1. Ferguson J. (2012) Description and annotation of biomedical data sets. J eSLIB, 1, 51–56.
    1. Hoopen P.T., Amid C., Buttigieg P.L.. et al. (2016) Value, but high costs in post-deposition data Curation. Database, 2016, 1–10. - PMC - PubMed
    1. McQuilton P., Gonzalez-Beltran A., Rocca-Serra P.. et al. (2016) BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences. Database, 2016, 1–8. - PMC - PubMed
    1. Taminau J., Steenhoff D., Coletta A.. et al. (2011) inSilicoDb : an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO. Bioinformatics, 27, 3204–3205. - PubMed

Publication types

MeSH terms

LinkOut - more resources