Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Mar 31:3:e791.
doi: 10.7717/peerj.791. eCollection 2015.

A reproducible approach to high-throughput biological data acquisition and integration

Affiliations

A reproducible approach to high-throughput biological data acquisition and integration

Daniela Börnigen et al. PeerJ. .

Abstract

Modern biological research requires rapid, complex, and reproducible integration of multiple experimental results generated both internally and externally (e.g., from public repositories). Although large systematic meta-analyses are among the most effective approaches both for clinical biomarker discovery and for computational inference of biomolecular mechanisms, identifying, acquiring, and integrating relevant experimental results from multiple sources for a given study can be time-consuming and error-prone. To enable efficient and reproducible integration of diverse experimental results, we developed a novel approach for standardized acquisition and analysis of high-throughput and heterogeneous biological data. This allowed, first, novel biomolecular network reconstruction in human prostate cancer, which correctly recovered and extended the NFκB signaling pathway. Next, we investigated host-microbiome interactions. In less than an hour of analysis time, the system retrieved data and integrated six germ-free murine intestinal gene expression datasets to identify the genes most influenced by the gut microbiota, which comprised a set of immune-response and carbohydrate metabolism processes. Finally, we constructed integrated functional interaction networks to compare connectivity of peptide secretion pathways in the model organisms Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa.

Keywords: Data acquisition; Data integration; Heterogeneous data; High-throughput data; Meta-analysis; Reproducibility.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. ARepA is an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing.
ARepA is a framework for reproducible biological data mining and analysis. It can retrieve heterogeneous data from multiple public repositories in a uniform environment and format, currently allowing configurable data access for any organism(s) to the Gene Expression Omnibus (GEO) (Barrett et al., 2011), IntAct (Kerrien et al., 2012), BioGRID (Stark et al., 2011), RegulonDB (Gama-Castro et al., 2011), STRING (Szklarczyk et al., 2011), Bacteriome (Su et al., 2008), and MPIDB (Goll et al., 2008) databases. Using ARepA includes the following steps: (i) user input, (ii) data processing, and (iii) output formatting. The input phase is the only aspect of the ARepA pipeline that requires direct user oversight. The user’s input to ARepA can be as simple a list of organisms-of-interest; ARepA then uses this list as a query for recovering interactome network and gene expression data specific to those organisms. Advanced users also have the option of providing custom gene mapping files, metadata, and/or normalization schemes, as well as fine-tuning the list of data sources to be searched. The data processing phase is divided into a series of automated steps in which raw interactome network and gene expression data are downloaded, converted to a common gene-naming scheme, and normalized for between-dataset comparison. During this phase, integrated gene expression data are analyzed for co-expression relationships, which contributes an additional co-expression network to the final network output. All network data are provided for the user as text files, while all expression data and associated metadata are saved as individual text files and as an R data file. The bottom panel illustrates how generated data can be integrated by subsequent network (see prostate cancer and bacterial studies) or expression (see murine differential gene expression) meta-analysis. For example, network integration is a convenient way to combines multiple datasets of different types and sources, such as co-expression, physical gene interactions, regulatory interactions, or posttranslational modification, into one functional network.
Figure 2
Figure 2. MEN1 and ACBD6 associated with the NFκB signaling pathway in human prostate cancer.
High confidence subgraph extracted from a functional network integrating ten prostate cancer specific gene expression data sets from GEO (Table S1). This subnetwork was generated using a seed gene set of ten genes from the NFκB signaling pathway in BioCarta (blue circles). Nine genes (black circles) were immediately recovered that are also known to be involved in NfκB signaling. Additional genes represent candidates implicated in NFκB involvement during prostate cancer, in particular MEN1 and ACBD6.
Figure 3
Figure 3. Differential expression meta-analysis of germ-free versus conventional mice.
ARepA metadata allowed the identification of six murine gene expression datasets with intestinal tissue from paired germ-free and conventional mice (Table S1). The automatically generated R expression sets were meta-analyzed using R/limma (Smith, 2005) and R/metafor (Viechtbauer, 2010) through a random-effects model, revealing the Ppar-α signaling pathway as one of several differentially regulated gene sets. In (A) the fold changes are presented for all significantly differentially expressed genes from this pathway in individual datasets, and (B/C) show the corresponding forest plots for the Ppar-α and Rxr-α genes, which are consistently upregulated in these datasets.
Figure 4
Figure 4. Integrated molecular networks for comparative microbial functional genomics.
ARepA allowed the retrieval of standardized gene expression and interaction data for three microbial species based on a shared gene identifier to assess functional differences in conserved and non-conserved secretion pathways. High-confidence subgraphs were extracted from species-specific integrated functional networks around genes from species-specific secretion pathways to identify highly functionally related gene clusters within each individual system. These subgraphs represent gene clusters of Sec and Tat genes in B. subtilis (A), sec, tat, and Type II genes in E. coli (B), and sec, tat, Type II, Type III, and Type VI genes in P. Aeruginosa (C). From each of these species-specific molecular networks we recovered highly functionally related gene clusters and conserved and non-conserved components from the peptide secretion system.
Figure 5
Figure 5. Analysis and processing steps available for datasets from each data source.
The main steps of ARepA are divided up into four components: (1) Configuration and data integration: optional user-provided information can be merged with default data/metadata from the repositories. This allows, for example, integration of expert curated metadata with automatically annotated metadata. (2) Custom data processing, including the default and customizable gene mapping and metadata annotation, as well as processes for file format detection and conversion. (3) Data normalization: gene identifiers are standardized, gene expression levels are normalized (e.g., log-transformed), missing values are imputed using k-nearest neighborhoods, and duplicate entries are merged. (4) Data export: data file formats are normalized to tab-delimited text, and co-expression networks in text and binary formats are constructed. Gene expression datasets and automatically generated documentation are further compiled into an R data file.

References

    1. Affymetrix . Statistical algorithms description document. Santa Clara: Affymetrix Inc; 2002.
    1. Aitken JD, Gewirtz AT. Gut microbiota in 2012: toward understanding and manipulating the gut microbiota. Nature Reviews Gastroenterology and Hepatology. 2013;10:72–74. doi: 10.1038/nrgastro.2012.252. - DOI - PMC - PubMed
    1. Aoyama T, Peters JM, Iritani N, Nakajima T, Furihata K, Hashimoto T, Gonzalez FJ. Altered constitutive expression of fatty acid-metabolizing enzymes in mice lacking the peroxisome proliferator-activated receptor alpha (PPARalpha) Journal of Biological Chemistry. 1998;273:5678–5684. doi: 10.1074/jbc.273.10.5678. - DOI - PubMed
    1. Backhed F, Ding H, Wang T, Hooper LV, Koh GY, Nagy A, Semenkovich CF, Gordon JI. The gut microbiota as an environmental factor that regulates fat storage. Proceedings of the National Academy of Sciences of the United States of America. 2004;101:15718–15723. doi: 10.1073/pnas.0407076101. - DOI - PMC - PubMed
    1. Baggerly KA, Coombes KR. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. The Annals of Applied Statistics. 2009;3:1309–1334. doi: 10.1214/09-AOAS291. - DOI

LinkOut - more resources