Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 5:13:RP97424.
doi: 10.7554/eLife.97424.

A model-based factorization method for scRNA data unveils bifurcating transcriptional modules underlying cell fate determination

Affiliations

A model-based factorization method for scRNA data unveils bifurcating transcriptional modules underlying cell fate determination

Jun Ren et al. Elife. .

Abstract

Manifold-learning is particularly useful to resolve the complex cellular state space from single-cell RNA sequences. While current manifold-learning methods provide insights into cell fate by inferring graph-based trajectory at cell level, challenges remain to retrieve interpretable biology underlying the diverse cellular states. Here, we described MGPfactXMBD, a model-based manifold-learning framework and capable to factorize complex development trajectories into independent bifurcation processes of gene sets, and thus enables trajectory inference based on relevant features. MGPfactXMBD offers a more nuanced understanding of the biological processes underlying cellular trajectories with potential determinants. When bench-tested across 239 datasets, MGPfactXMBD showed advantages in major quantity-control metrics, such as branch division accuracy and trajectory topology, outperforming most established methods. In real datasets, MGPfactXMBD recovered the critical pathways and cell types in microglia development with experimentally valid regulons and markers. Furthermore, MGPfactXMBD discovered evolutionary trajectories of tumor-associated CD8+ T cells and yielded new subtypes of CD8+ T cells with gene expression signatures significantly predictive of the responses to immune checkpoint inhibitor in independent cohorts. In summary, MGPfactXMBD offers a manifold-learning framework in scRNA-seq data which enables feature selection for specific biological processes and contributing to advance our understanding of biological determination of cell fate.

Keywords: bifurcation process; computational biology; factorization; human; manifold-learning; mixtures of Gaussian processes; mouse; scRNA-seq; systems biology.

PubMed Disclaimer

Conflict of interest statement

JR, YZ, YH, JY, HF, XL, JG, XS, QL No competing interests declared

Figures

Figure 1.
Figure 1.. Overview of MGPfact workflow.
The complete workflow comprises two major stages: minimum unbiased representative points (MURP) downsampling with preprocessed data and trajectory reconstruction. In the stage of trajectory reconstruction, the single-cell RNA sequencing (scRNA-seq) data were first factorized into independent bifurcation processes based on mixtures of Gaussian processes, which were then merged into a consensus trajectory.
Figure 2.
Figure 2.. Trajectory inference (TI) performance of state-of-the-art methods in 239 test datasets.
(a) Overall scores; (b) F1branches; (c) HIM; (d) cordist ; (e) wcorfeatures. All results are color-coded based on the trajectory types, with the black line representing the mean value. The ‘Overall’ assessment is calculated as the geometric mean of all four metrics.
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. The distribution of trajectory types among training set and test set.
The 339 datasets were randomly split into two groups, serving as the training set and test set, respectively. The training set consists of 100 datasets, while the testing set includes 239 datasets.
Figure 2—figure supplement 2.
Figure 2—figure supplement 2.. Robustness testing of the number of independent trajectories on 100 training datasets.
With L=3 set as the default, we tested the impact of different L values (1, 2, 4, and 5) on the prediction results. In all box plots, the asterisk represents the mean value, while the whiskers extend to the farthest data points within 1.5 times the interquartile range. Significance is denoted as follows: not annotated indicates non-significant; * P<0.05; ** P<0.01; *** P<0.001; two-sided paired Student’s T-tests.
Figure 2—figure supplement 3.
Figure 2—figure supplement 3.. MURP downsampling enhances the performance of MGPfact.
Trajectory inference was conducted by randomly selecting 20, 40, 60, 80, and 100 cells from the original dataset. The results were mapped back to the original data using a KNN graph structure to obtain the final predictions, which were then compared with those obtained through MURP downsampling. In all box plots, the asterisk represents the mean value, while the whiskers extend to the farthest data points within 1.5 times the interquartile range. Significance is denoted as follows: not annotated indicates non-significant; * P<0.05; ** P<0.01; *** P<0.001; two-sided paired Student’s T-tests.
Figure 2—figure supplement 4.
Figure 2—figure supplement 4.. Robustness analysis of consensus trees.
Random sampling is performed on the original data at different proportions of 60%, 70%, 80%, and 90%. The MGPfact consensus trajectory predictions are compared with that based on the original, unsampled data using HIM. In all box plots, the asterisk represents the mean value, while the whiskers extend to the farthest data points within 1.5 times the interquartile range.
Figure 3.
Figure 3.. Trajectory inference (TI) performance of state-of-the-art methods in 68 test datasets of real cell population.
(a) Overall scores; (b) F1branches; (c) HIM; (d) cordist; (e) wcorfeatures. All results are color-coded based on the trajectory types, with the black line representing the mean value for ranking all methods. The ‘Overall’ assessment is calculated as the geometric mean of all four metrics.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Trajectories identified by different methods on 3 real-world datasets, with reference structures being linear (a) dataset-id=real/silver/germline-human-female_li, bifurcation (b) dataset-id=real/silver/fibroblast-reprogramming_treutlein, and multifurcation (c) dataset-id=real/silver/oligodendrocyte-differentiation-subclusters_marques.
The first column of each row presents the actual cell development trajectory (‘ground truth’), with black lines representing the trajectory structure and the colored dots indicating cells at different stages. These elements highlight the main paths and key bifurcation points of cell differentiation as a reference for evaluating prediction results. Subsequent columns show the trajectory reconstruction results of MGPfact and other methods. The overall score for each method reflects its accuracy in trajectory inference relative to the gold standard, which is represented in the top-left figure.
Figure 4.
Figure 4.. MGPfact reconstructed the developmental trajectory of microglia, recovering known determinants of microglia fate.
(a-c) The inferred independent bifurcation processes with respect to the unique cell types (color-coded) of microglia development, where phase 0 corresponds to the state before bifurcation; and phases 1 and 2 correspond to the states post-bifurcation. Each colored dot represents a metacell of unique cell type defined by MURP. The most highly weighted regulons in each trajectory were labeled by the corresponding transcription factors (left panels). The HWG of each bifurcation process include a set of highly weighted genes, of which the expression levels differ significantly among phases 1, 2, and 3 (right panels). (d) The most highly weighted regulons influencing the three developmental trajectories of microglia. (e) The expression levels of the transcription factors of highly weighted regulons in each trajectory significantly differ among different phases. (f) The consensus developmental trajectory by merging the three bifurcation processes. Point 0 denotes the initial of differentiation, whereas the notion of ‘n-m’ denotes the m-th branch from the branching point n. Each colored circle represents a landmark (MURP) of the trajectory, showing the fraction of cell types. The transcription factors of highly weighted regulons in each bifurcation process were used to label each branching point. Particularly, PAM-T1 and PAM-T2 are the two newly defined subtypes of PAM. (g) Selected differently expressed genes between PAM-T1 and PAM-T2 (|logfc|>0.25, adjusted P-value <0.1) are shown by colored-dots corresponding to the mean expression levels in either cell type. The IDs validated marker genes for PAM are labeled in green. In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance is denoted as: ns, not significant, * P<0.05, ** P<0.01, *** P<0.001, **** P<0.0001, Wilcoxon rank-sum test.
Figure 5.
Figure 5.. Highly weighted genes (HWG) of the bifurcation processes of CD8+ T cells serve as reliable indicators for clinical outcome and immune checkpoint inhibitor (ICI) treatment response.
(a) Gene expression signatures (GES) corresponding to HWG in CD8+ T cells trajectory 1 and 2 in non-small cell lung cancer (NSCLC) predict overall survival of the TCGA-LUAD cohort. (b) GES corresponding to HWG in CD8+ T cells trajectory 1 in colorectal cancer (CRC) predict overall survival of the TCGA-COAD cohort. (c) ROC curve showing the weighted mean of HWG in Trajectories 1 and 2 in NSCLC significantly associated with ICI response across three independent studies. (d) ROC curve showing the weighted mean of HWG in trajectories 1 and 2 in CRC significantly associated with ICI response across four independent studies.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Differentiation process and determinants of CD8+ T cells from non-small cell lung cancer (NSCLC) environment.
(a) The independent trajectories, where phase 0 represents the juncture before bifurcation, while phases 1 and 2 represent the diverging branches post-bifurcation. (b) Proportions of cell types in different phases of independent trajectories; (c) The regulon weight of top regulons of two trajectories. (d) The differential expression of top transcription factors among different phases for each trajectory. In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance denoted as: *p<0.05, ****p<0.0001, two-sided unpaired Student’s t-test.
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. Differentiation process and determinants of CD8+ T cells from colorectal cancer (CRC) environment.
(a) The independent trajectories, where phase 0 represents the juncture before bifurcation, while phases 1 and 2 represent the diverging branches post-bifurcation. (b) Proportions of cell types in different phases of independent trajectories; (c) The regulon weight of top regulons of two trajectories. (d) The differential expression of top transcription factors among different phases for each trajectory. (e) Survival analysis of the patients from TCGA-COAD cohort. The patients are stratified by utilizing the branch event within trajectory 2 from MGPfact analysis. p-values are calculated using multivariate Cox regression, with HR representing the hazard ratio. (f) ROC curve for immune checkpoint inhibitor (ICI) treatment response associated with highly weighted genes related to Trajectories 2 in CRC across four independent studies. In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance denoted as: ns, not significant, *p<0.05, **p<0.01, ***p<0.001, ****p<0.0001, two-sided unpaired Student’s t-test.
Figure 5—figure supplement 3.
Figure 5—figure supplement 3.. In non-small cell lung cancer (NSCLC), the weighted mean of highly weighted genes (HWG) from independent differentiation trajectories differs between phases (a, c) and immune checkpoint inhibitor (ICI) treatment groups.
(a-b) The HWG obtained from the first branching event. (c-d) The HWG obtained from the first branching event. In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance denoted as: ns, not significant, *p<0.05, two-sided unpaired Student’s t-test.
Figure 5—figure supplement 4.
Figure 5—figure supplement 4.. In colorectal cancer (CRC), the weighted mean of highly weighted genes (HWG) from independent differentiation trajectories differs between phases (a, c) and immune checkpoint inhibitor (ICI) treatment groups.
(a-b) The HWG obtained from the first branching event. (c-d) The HWG obtained from the first branching event. In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance denoted as: ns, not significant, *p<0.05, ****p<0.0001, two-sided unpaired Student’s t-test.
Figure 6.
Figure 6.. MGPfact serves as an effective approach for characterization of new cellular subtypes.
(a) The consensus trajectory of tumor-associated CD8+ T cells in non-small cell lung cancer (NSCLC) identified CD8-ZNF683-T1 and CD8-ZNF683-T2 as two subtypes of CD8-ZNF683, which are influenced by TBX21. (b) Selected differently expressed genes between CD8-ZNF683-T1 and CD8-ZNF683-T2 (|logfc|>0.25, adjusted p-value <0.1). (c) High expression of CD8-ZNF683-T1 signatures predicts good overall survival in the TCGA LUAD cohort (Methods). p-values were calculated through multivariate Cox regression analysis, and HR represents hazard ratio. (d) The consensus trajectory of tumor-associated CD8+ T cells in colorectal cancer (CRC) identified CD8-GZMK-T1 and CD8-GZMK-T2 as two subtypes of CD8-GZMK. (e) Selected differently expressed genes between CD8-GZMK-T1 and CD8-GZMK-T2 (|logfc|>0.25, adjusted p-value <0.1). (f) ROC curve showing high expression of CD8-GZMK-T1 signature associated with ICI treatment response in three independent studies. The consensus trajectory is formed by merging three bifurcation processes. Each colored circle represents a landmark (minimum unbiased representative points, MURP), indicating the cell type.
Figure 6—figure supplement 1.
Figure 6—figure supplement 1.. The CD8-GZMK-T1 scores between immune checkpoint inhibitor (ICI) treatment response groups (Methods).
In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance denoted as: ns, not significant, *p<0.05, two-sided unpaired Student’s t-test.
Author response image 1.
Author response image 1.. Comparison of MGPfact with scTour and TIGON in trajectory inference performance across 239 test datasets.
(a) Overall scores; (b) F1branches; (c) HIM; (d) cordist; (e) wcorfeatures. All results are color-coded based on the trajectory types, with the black line representing the mean value. The “Overall” assessment is calculated as the geometric mean of all four metrics.
Author response image 2.
Author response image 2.. Robustness testing of the number of MURP PCA components on 100 training datasets.
With the number of principal components (PCs) set to 3 by default; we tested the impact of different number of components (1-10) on the prediction results. In all box plots, the asterisk represents the mean value, while the whiskers extend to the farthest data points within 1.5 times the interquartile range. Significance is denoted as follows: not annotated indicates non-significant; * P < 0.05; ** P < 0.01; *** P < 0.001; two-sided paired Student’s T-tests.

Update of

  • doi: 10.1101/2024.04.02.587768
  • doi: 10.7554/eLife.97424.1
  • doi: 10.7554/eLife.97424.2

References

    1. Aerts S, Quan XJ, Claeys A, Naval Sanchez M, Tate P, Yan J, Hassan BA. Robust target gene discovery through transcriptome perturbations and genome-wide enhancer predictions in Drosophila uncovers a regulatory basis for sensory specification. PLOS Biology. 2010;8:e1000435. doi: 10.1371/journal.pbio.1000435. - DOI - PMC - PubMed
    1. Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S. SCENIC: single-cell regulatory network inference and clustering. Nature Methods. 2017;14:1083–1086. doi: 10.1038/nmeth.4463. - DOI - PMC - PubMed
    1. Anderson SR, Roberts JM, Ghena N, Irvin EA, Schwakopf J, Cooperstein IB, Bosco A, Vetter ML. Neuronal apoptosis drives remodeling states of microglia and shifts in survival pathway dependence. eLife. 2022;11:e76564. doi: 10.7554/eLife.76564. - DOI - PMC - PubMed
    1. Auslander N, Zhang G, Lee JS, Frederick DT, Miao B, Moll T, Tian T, Wei Z, Madan S, Sullivan RJ, Boland G, Flaherty K, Herlyn M, Ruppin E. Robust prediction of response to immune checkpoint blockade therapy in metastatic melanoma. Nature Medicine. 2018;24:1545–1549. doi: 10.1038/s41591-018-0157-9. - DOI - PMC - PubMed
    1. Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, Ginhoux F, Newell EW. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology. 2019;37:38–44. doi: 10.1038/nbt.4314. - DOI - PubMed

Substances

Associated data

LinkOut - more resources