Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures

Benedict Anchang¹, Raul Mendez-Giraldez¹, Xiaojiang Xu², Trevor K Archer³, Qing Chen³, Guang Hu³, Sylvia K Plevritis⁴, Alison Anne Motsinger-Reif¹, Jian-Liang Li²

Affiliations

¹ Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Stanford, California, USA.
² Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, Stanford, California, USA.
³ Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA.
⁴ Department of Biomedical Data Science, Center for Cancer Systems Biology, Stanford University, Stanford, California, USA.

PMID: 35192692
PMCID: PMC8921621
DOI: 10.1093/bib/bbac017

Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures

Benedict Anchang et al. Brief Bioinform. 2022.

. 2022 Mar 10;23(2):bbac017.

doi: 10.1093/bib/bbac017.

Authors

Benedict Anchang¹, Raul Mendez-Giraldez¹, Xiaojiang Xu², Trevor K Archer³, Qing Chen³, Guang Hu³, Sylvia K Plevritis⁴, Alison Anne Motsinger-Reif¹, Jian-Liang Li²

Affiliations

¹ Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Stanford, California, USA.
² Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, Stanford, California, USA.
³ Epigenetics & Stem Cell Biology Laboratory/Chromatin & Gene Expression Group, National Institute of Environmental Health Sciences, Stanford, California, USA.
⁴ Department of Biomedical Data Science, Center for Cancer Systems Biology, Stanford University, Stanford, California, USA.

PMID: 35192692
PMCID: PMC8921621
DOI: 10.1093/bib/bbac017

Abstract

A major topic of debate in developmental biology centers on whether development is continuous, discontinuous, or a mixture of both. Pseudo-time trajectory models, optimal for visualizing cellular progression, model cell transitions as continuous state manifolds and do not explicitly model real-time, complex, heterogeneous systems and are challenging for benchmarking with temporal models. We present a data-driven framework that addresses these limitations with temporal single-cell data collected at discrete time points as inputs and a mixture of dependent minimum spanning trees (MSTs) as outputs, denoted as dynamic spanning forest mixtures (DSFMix). DSFMix uses decision-tree models to select genes that account for variations in multimodality, skewness and time. The genes are subsequently used to build the forest using tree agglomerative hierarchical clustering and dynamic branch cutting. We first motivate the use of forest-based algorithms compared to single-tree approaches for visualizing and characterizing developmental processes. We next benchmark DSFMix to pseudo-time and temporal approaches in terms of feature selection, time correlation, and network similarity. Finally, we demonstrate how DSFMix can be used to visualize, compare and characterize complex relationships during biological processes such as epithelial-mesenchymal transition, spermatogenesis, stem cell pluripotency, early transcriptional response from hormones and immune response to coronavirus disease. Our results indicate that the expression of genes during normal development exhibits a high proportion of non-uniformly distributed profiles that are mostly right-skewed and multimodal; the latter being a characteristic of major steady states during development. Our study also identifies and validates gene signatures driving complex dynamic processes during somatic or germline differentiation.

Keywords: cell differentiation; forest mixtures; minimum spanning tree; multimodality; nested models; single-cell trajectory analysis.

Published by Oxford University Press 2022.

PubMed Disclaimer

Figures

**Figure 1**
DSFMix takes as input a time course or staging single-cell data and outputs a dynamic spanning forest (DSF). **(A)** A three-dimensional single-cell time-course input data for DSFMix comprising two time points, or stages of development. The green cells in the first time point (t1) occupy a different region in the second time point (t2) and differentiate into light green, red, and yellow cells. **(B)** A binary decision tree and feature selection process representing a shape analysis to generate an optimal lineage marker set for enrichment analysis. The shape analysis step uses a predefined FDR to select variable markers based on the shapes of marginal distribution. The step produces markers whose expression across cells is multimodal, unimodal but symmetrical, left-skewed, or right-skewed. **(C)** The enrichment analysis feature step selects markers for cluster specificity, time specificity and cluster-time interaction specificity using a boosted random forest regression model with binary and multinomial outcome. **(D)** Minimum spanning tree derived using SPADE. **(E)** Tree agglomerative hierarchical clustering uses geodesic distances and spearman correlation between all node pairs to produce a sorted dendrogram that represents the merging process of all nodes from the input tree. **(F)** A dynamic minimum spanning forest produced from minimizing distances between and within clusters in the TAHC clusters. The clusters are derived from a dynamic and iterative branch-cutting method based on the structure of the underlying dendrogram.

**Figure 2**
Analysis of DSFMix feature selection step applied to various developmental processes. **(A)** Histograms (excluding zeros) showing examples of genes associated with maximum marginal spread and shape for spermdata (top), ipscdata (middle), and hormonedata (bottom). **(B)** Barplots showing the normalized marginal expression of genes during biological process. A very high proportion of non-uniform distribution of shapes which are mostly skewed to the right is observed. **(C)** Distribution of enrichment genes associated with multimodality, unimodal and symmetrical, left skewness and right skewness for spermdata, ipscdata and emtdata. The multimodal genes are enriched the most during development.

**Figure 3**
DSFMix Benchmarking analysis with respect to time and network similarity. **(A–C)** Heatmaps representing hierarchical clustering of the pairwise distance matrices derived from spearman correlations between pseudo-time models; tSpace (1), Monocle3(2), PAGA (5) and slingshot (6) observed time (3) and DSFMix (4) predicted time ordering for ipscdata, spermdata and emtdata respectively. Highest similarity between DSFMix (4) predicted time and the observed time trend (3) is observed. **(D)** Box plots representing correlation trends with estimates (top) between the observed time trend and pseudo-time or predicted time for each method applied to the spermdata. Large variations in terms of correlations are shown with the highest correlation (0.67) associated with DSFMix. **(E)** Visualizations of underlying EMT trajectories on projected data by all 6 methods. Evidence of EMT-MET plasticity in terms of 2 independent trees is captured clearly by DSFMix. In panel E (iv), the size of the nodes within each subtree is proportional to the number of cells in that node, whereas the length of the edges reflects the Euclidean distance of the median expression. **(F)** Correlation network analysis comparing recently developed discrete temporal methods: (iii) Cstreet, (iv) Tempora, (v) Scuba, and derived directed DSFMix network (ii) with reference EMT (i) network. Quantitative correlation measures in Table (vi) show that all methods except for Tempora demonstrate a significant correlation structure at 0.05 level with the reference using Quadratic Assignment Procedure (QAP), with CStreet capturing the highest correlation.

**Figure 4**
DSFMix Analysis on spermatogenesis. **(A)** Forest comprising six MSTs colored by median staging times spanning approximately 80 days. Cells undergoing early spermatogenesis are colored in blue while late sperm formation is colored in red. **(B)** DSF plots highlighting the most dynamic genes that span individual subtrees during spermatogenesis. **(C and D)** Forest plots highlighting key genes that are regulated during branching stages of spermatogenesis. **(E)** Heatmap of similarity test statistics between all pairwise trees (1–6) in A based on two-sample multivariate weighted edge count test. Blue represents identical trees, and red represents statistically significant dissimilar trees. Trees (2) and (3) have greater similarity with the major tree (1) compared to the rest. In all the figure panels, size of each node within each subtree is proportional to the number of cells in that node, whereas the length of the edges reflects the Euclidean distance of the mean expression between the 2 connected nodes.

**Figure 5**
DSFMix analysis on chemically induced pluripotent stem cell (CiPSC) reprogramming. **(A)** Heatmaps of gene signatures showing switch-like programs associated with the transition of MEFs to intermediate extraembryonic endoderm (XEN-like) cells at day 5 as well as embryonic stem cell (ESC) formation. **(B)** SPADE single tree highlighting uniqueness of CrxOS expression in one of its three terminal branches during iPSC reprogramming. **(C)** DSFMix forest comprising seven trees colored by the 12 timepoints corresponding to 3 stages, spanning ~21 days. Cells from stage I after induction were collected at days 5 and 12; cells at stage II were collected at days 8 and 12; and cells at stage III were collected at days 3, 6, 8, 10, 15, and 21. DSF subtrees capturing several linear dynamic lineages spanning complete Fibroblast-XEN-ESC lineages (1,2,5), Fibroblasts including two-cell (2C) embryonic-like cells (6), intermediate XEN-like cells (7), and early pluripotency (3,4) lineages. **(D)** DSF plot highlighting significant dynamic genes spanning individual subtrees during iPSC. (**E and F)** DSF plots highlighting key genes whose expression changes dynamically over time during CiPSC. **(G)** Heatmap of similarity related P-values between tree pairs in C. Trees (2) and (3) are significantly different from the other trees while trees (4) and (7) show strong similarity.

See this image and copyright information in PMC

References

1. Trapnell C, Cacchiarelli D, Grimsby J, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 2014;32:381–6. - PMC - PubMed
1. Cao J, Spielmann M, Qiu X, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 2019;566:496–502. - PMC - PubMed
1. Moon KR, van Dijk D, Wang Z, et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 2019;37:1482–92. - PMC - PubMed
1. Moon KR, van Dijk D, Wang Z, et al. Author correction: visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 2020;38:108. - PubMed
1. Dermadi D, Bscheider M, Bjegovic K, et al. Exploration of cell development pathways through high-dimensional single cell analysis in trajectory space. iScience 2020;23:100842. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures

Affiliations

Visualization, benchmarking and characterization of nested single-cell heterogeneity as dynamic forest mixtures

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources