Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 23;25(1):388.
doi: 10.1186/s12859-024-05988-z.

scEGOT: single-cell trajectory inference framework based on entropic Gaussian mixture optimal transport

Affiliations

scEGOT: single-cell trajectory inference framework based on entropic Gaussian mixture optimal transport

Toshiaki Yachimura et al. BMC Bioinformatics. .

Abstract

Background: Time-series scRNA-seq data have opened a door to elucidate cell differentiation, and in this context, the optimal transport theory has been attracting much attention. However, there remain critical issues in interpretability and computational cost.

Results: We present scEGOT, a comprehensive framework for single-cell trajectory inference, as a generative model with high interpretability and low computational cost. Applied to the human primordial germ cell-like cell (PGCLC) induction system, scEGOT identified the PGCLC progenitor population and bifurcation time of segregation. Our analysis shows TFAP2A is insufficient for identifying PGCLC progenitors, requiring NKX1-2. Additionally, MESP1 and GATA6 are also crucial for PGCLC/somatic cell segregation.

Conclusions: These findings shed light on the mechanism that segregates PGCLC from somatic lineages. Notably, not limited to scRNA-seq, scEGOT's versatility can extend to general single-cell data like scATAC-seq, and hence has the potential to revolutionize our understanding of such datasets and, thereby also, developmental biology.

Keywords: Epigenetic landscape; Gaussian mixture model; Optimal transport; Single-cell biology; Trajectory inference.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: The experiments involving hPGCLCs induced from hiPSCs were approved by the Institutional Review Board of Kyoto University and were also performed in accordance with the guidelines of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of Japan. Consent for publication: It does not include individual data. Competing interests: The authors declare no Conflict of interest.

Figures

Fig. 1
Fig. 1
Sketch of the framework of scEGOT. scEGOT is a trajectory inference method based on entropic Gaussian mixture optimal transport (EGOT) that extracts various local and global structures of cell differentiation from scRNA-seq data. scEGOT takes time-series scRNA-seq data as input and outputs the following six cell-differentiation structures: (i) cell state graphs: transitions between cell populations over time; (ii) cell velocity: velocity of cell differentiation in gene expression space; (iii) interpolation: generation of pseudo-scRNA-seq data at intermediate time points; (iv) animation: visualization of gene expression dynamics; (v) gene regulatory network (GRN): regulatory relationships between genes during transitions; and (vi) Waddington’s landscape: cell potency and a global view of the cell differentiation pathway
Fig. 2
Fig. 2
Illustration of the connection between EGOT (2) and continuous optimal transport (4)
Fig. 3
Fig. 3
Application of scEGOT to the human PGCLC induction system dataset and identification of differentiation pathways. A, B PCA plots of the PGCLC induction dataset. The cells are colored according to A experimental day and B cell type. The gray points in B are cells at the middle stages (days 0.5-1.5). C Verification of scEGOT interpolation. Comparison between the reference distribution (day 1) and scEGOT interpolation distribution generated by datasets at days 0.5 and 1.5 on PCA coordinates. D Box plot of the silhouette scores over 100 trials of the scEGOT interpolation versus the source (day 0.5)/reference (day 1)/target (day 1.5). E, F Cell state graphs on PCA coordinates and by a hierarchical layout. The colors of the edges denote the transport rates (wk,l/πik). G, H Volcano plots for day 0.5 (day 0.5–1 and day 0.5–2) and day 1 (day 1–1 and day 1–2) clusters. The horizontal and vertical lines show the log2 fold change, where the fold change indicates the rate of variation of gene expression and the negative common logarithm of p-values calculated from the independent samples t test, respectively. The annotated points show cluster-specific genes with G |log2(Foldchange)|>0.6 and -log10(Pvalue)>150 and H |log2(Foldchange)|>0.8 and -log10(Pvalue)>25
Fig. 4
Fig. 4
Comparison of velocities between scEGOT (cell velocity) and scVelo (RNA velocity). A Stream plot of cell velocity generated by scEGOT. B Stream plot of RNA velocity generated by scVelo (stochastic mode). C Left: Percentage of genes for which velocities can be computed by scEGOT and scVelo. Right: Histogram of the coverage of scVelo for mean expression levels. The red dashed line is the total coverage of scVelo (92.3%). D Cell velocity for scATAC-seq data of mouse innate immune cells at three-time points (days 0, 1, and 28) [53]
Fig. 5
Fig. 5
Input scRNA-seq data (gray columns) and interpolated data (white columns) on the top two principal components. The first row shows the contour plots of the Gaussian mixture distributions. The second row denotes the cell populations of the real scRNA-seq data (gray columns) and those generated by the interpolated Gaussian mixture distributions (white columns). The third to sixth rows show the gene expression values of NKX1-2, TFAP2A, TFAP2C, and PRDM1 in the cell populations
Fig. 6
Fig. 6
Reconstruction of Waddington’s landscape and inferring the GRNs for the human PGCLC induction system. A Gene regulatory networks of the human PGCLC induction system generated from scRNA-seq data at days 0 to -0.5, days 0.5–1, days 1-1.5, and days 1.5–2. BE Reconstruction of Waddington’s landscape of human PGCLC induction data. The x-, y-, and z-axes denote the PC1, PC2 coordinates, and the Waddington potential, respectively. The visualization was prepared using CellMapViewer: https://github.com/yusuke-imoto-lab/CellMapViewer. The colors indicate B the magnitude of the potential, C NKX1-2, D MESP1, and E GATA6 expression values

Similar articles

Cited by

References

    1. Waddington CH. The strategy of the genes: a discussion of some aspects of theoretical biology. Crows Nest: Allen & Unwin; 1957.
    1. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201. - PMC - PubMed
    1. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14. - PMC - PubMed
    1. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21(1):1–35. - PMC - PubMed
    1. Teschendorff AE, Feinberg AP. Statistical mechanics meets single-cell biology. Nat Rev Genet. 2021;22(7):459–76. - PMC - PubMed

LinkOut - more resources