Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jun 27;12(6):e0179613.
doi: 10.1371/journal.pone.0179613. eCollection 2017.

A data analysis framework for biomedical big data: Application on mesoderm differentiation of human pluripotent stem cells

Affiliations

A data analysis framework for biomedical big data: Application on mesoderm differentiation of human pluripotent stem cells

Benjamin Ulfenborg et al. PLoS One. .

Abstract

The development of high-throughput biomolecular technologies has resulted in generation of vast omics data at an unprecedented rate. This is transforming biomedical research into a big data discipline, where the main challenges relate to the analysis and interpretation of data into new biological knowledge. The aim of this study was to develop a framework for biomedical big data analytics, and apply it for analyzing transcriptomics time series data from early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. To this end, transcriptome profiling by microarray was performed on differentiating human pluripotent stem cells sampled at eleven consecutive days. The gene expression data was analyzed using the five-stage analysis framework proposed in this study, including data preparation, exploratory data analysis, confirmatory analysis, biological knowledge discovery, and visualization of the results. Clustering analysis revealed several distinct expression profiles during differentiation. Genes with an early transient response were strongly related to embryonic- and mesendoderm development, for example CER1 and NODAL. Pluripotency genes, such as NANOG and SOX2, exhibited substantial downregulation shortly after onset of differentiation. Rapid induction of genes related to metal ion response, cardiac tissue development, and muscle contraction were observed around day five and six. Several transcription factors were identified as potential regulators of these processes, e.g. POU1F1, TCF4 and TBP for muscle contraction genes. Pathway analysis revealed temporal activity of several signaling pathways, for example the inhibition of WNT signaling on day 2 and its reactivation on day 4. This study provides a comprehensive characterization of biological events and key regulators of the early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. The proposed analysis framework can be used to structure data analysis in future research, both in stem cell differentiation, and more generally, in biomedical big data analytics.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: CA, KÅ, and CXA are employed by Takara Bio Europe AB. PS is employed by AstraZeneca, Gothenburg, Sweden. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

Fig 1
Fig 1. Data analysis framework.
The figure illustrates the general analysis framework proposed in the study. The five stages of the framework are shown to the left and the steps within each stage are indicated by boxes. The specific methodology applied in the present study is shown to the right. Stage I: Data preparation. The raw microarray dataset was normalized with RMA and genes with low-expression or small variation in expression were removed. Stage II: Exploratory analysis. The pre-processed dataset of 1,108 genes and 11 time points was subject to k-means clustering with k = 10 and Pearson correlation as distance measure (see section Stage I: data preparation for more details). Stage III: Confirmatory analysis. Enrichment analysis was carried out for the genes in each k-means cluster to identify enriched Gene Ontology terms and transcription factors. Pathway analysis was performed with SPIA to infer pathway activity at each time point. Stage IV: Visualization of results. Biological processes significantly enriched among the genes in each cluster were visualized with TreeMaps. In addition, the impact of gene expression changes on pathway activity was clarified with pathway maps. Stage V: Knowledge discovery. The biological findings were incorporated into a roadmap that captures the main biological events and regulators of early hPSC differentiation towards the mesoderm lineage.
Fig 2
Fig 2. Cluster profiles of k-means clusters.
(A) Pearson correlation and (B) Euclidean distance as distance measure. Each line represents a gene and color is used to distinguish individual genes in dense clusters. Euclidean clusters appear denser with genes being grouped together based on height along the Y-axis (gene expression). The clustering based on Pearson correlation appears sparser with genes grouped together based on the shape of expression profiles over time.
Fig 3
Fig 3. Temporal expression of selected mesoderm and cardiac markers.
Expression profiles of known marker genes have been organized into six groups based on when peaks in expression are observed. Expression values have been normalized to the maximum expression value for each marker.
Fig 4
Fig 4. TreeMap visualization of significantly enriched biological processes.
(A) Cluster 1, (B) cluster 4 and (C) cluster 5. Boxes represent terms and the size of the boxes reflects the significance of the corresponding p-value. Terms are grouped into overarching terms, which are visualized in different colors. The majority of terms for clusters 1, 4 and 5 are related to embryonic development.
Fig 5
Fig 5. TreeMap visualization of significantly enriched biological processes.
(A) Cluster 7, (B) cluster 9 and (C) cluster 10. Boxes represent terms and the size of boxes reflects the significance of the corresponding p-value. Terms are grouped into overarching terms, which are visualized in different colors. Many terms for clusters 7, 9 and 10 are related to muscle development and ion transport.
Fig 6
Fig 6. Relative pathway perturbation profiles.
Each line represents a pathway identified as significant in the SPIA analysis. The Y-axis shows the tA score from SPIA, or total accumulated pathway perturbation. This is calculated as the sum of all pathway fold changes following propagation of fold changes based on pathway topology. A large positive number indicates that many genes in the pathway are upregulated, while a negative number indicates the opposite. Zero indicates that the pathway has the same activity as the baseline condition, i.e. day 0.
Fig 7
Fig 7. The canonical WNT signaling pathway in KEGG visualized with Pathview.
Genes in the pathway are represented with rectangles. Gene expression is indicated with a color code, where red represents high expression and gray represents expression close to 0. (A) Expression at day 1, (B) expression at day 2, (C) expression at day 3, (D) expression at day 4. This pathway visualization reveals the activation of WNT inhibitors CER1, DKK1 and DKK4 at day 2 and 3 (indicated in red boxes).
Fig 8
Fig 8. Gene-transcription factor interaction network.
Genes in the WNT signaling pathway are represented by green circles and predicted transcription factors by blue octagons. When analyzed together with global transcriptome data, this network can shed light on the regulatory mechanisms behind pathway perturbation. Red edges indicate that genes activate the WNT pathway, while blue edges denote inhibitory genes.
Fig 9
Fig 9. A roadmap for early differentiation of hPSCs towards the mesoderm lineage and cardiac specification.
Time points are shown along the black horizontal line. Enrichment analysis results for clusters of genes are shown below, where each colored bar contains one/two representative terms from the TreeMap of a cluster (C1 denotes cluster 1). Transcription factors enriched for each cluster are shown in bold, and examples of genes in the cluster are given below the transcription factors. Vertical arrows beside the colored bars indicate changes in expression level of genes at different time points. For example, genes in cluster 4 are activated at day 1 and inactivated by day 4. Changes in signaling pathway activity are shown at the top, where arrows indicate at which time points different pathways are inhibited and activated.

Similar articles

Cited by

References

    1. Margolis R, Derr L, Dunn M, Huerta M, Larkin J, Sheehan J, et al. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data. J Am Med Informatics Assoc. 2014;21: 957–8. - PMC - PubMed
    1. Bacardit J, Widera P, Lazzarini N, Krasnogor N. Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example. Big data. 2014;2: 164–176. doi: 10.1089/big.2014.0023 - DOI - PMC - PubMed
    1. Binder H, Blettner M. Big data in medical science—a biostatistical view. Dtsch Ärzteblatt Int. 2015;112: 137–42. - PMC - PubMed
    1. Ho P-J, Yen M-L, Yet S-F, Yen BL. Current Applications of Human Pluripotent Stem Cells: Possibilities and Challenges. Cell Transplant. 2012;21: 801–814. doi: 10.3727/096368911X627507 - DOI - PubMed
    1. Thies RS, Murry CE. The advancement of human pluripotent stem cell-derived therapies into the clinic. Development. 2015;142: 3077–3084. doi: 10.1242/dev.126482 - DOI - PMC - PubMed

MeSH terms