Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan;22(1):68-81.
doi: 10.1038/s41592-024-02378-4. Epub 2024 Sep 19.

Gene-level alignment of single-cell trajectories

Affiliations

Gene-level alignment of single-cell trajectories

Dinithi Sumanaweera et al. Nat Methods. 2025 Jan.

Abstract

Single-cell data analysis can infer dynamic changes in cell populations, for example across time, space or in response to perturbation, thus deriving pseudotime trajectories. Current approaches comparing trajectories often use dynamic programming but are limited by assumptions such as the existence of a definitive match. Here we describe Genes2Genes, a Bayesian information-theoretic dynamic programming framework for aligning single-cell trajectories. It is able to capture sequential matches and mismatches of individual genes between a reference and query trajectory, highlighting distinct clusters of alignment patterns. Across both real world and simulated datasets, it accurately inferred alignments and demonstrated its utility in disease cell-state trajectory analysis. In a proof-of-concept application, Genes2Genes revealed that T cells differentiated in vitro match an immature in vivo state while lacking expression of genes associated with TNF signaling. This demonstrates that precise trajectory alignment can pinpoint divergence from the in vivo system, thus guiding the optimization of in vitro culture conditions.

PubMed Disclaimer

Conflict of interest statement

Competing interests: S.A.T. is a scientific advisory board member of ForeSite Labs, OMass Therapeutics, QIAGEN, a co-founder and equity holder of TransitionBio and EnsoCell Therapeutics, a nonexecutive director of 10x Genomics and a part-time employee of GlaxoSmithKline. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Computational alignment of single-cell transcriptomic trajectories.
a, Schematic of the concept of single-cell trajectory alignment. The input is single-cell transcriptomic data of a reference and a query that change dynamically (left), for example, in vivo cell development and in vitro cell differentiation, control and drug-treated cells in response to perturbation, responses to vaccination or pathogen challenge in healthy and diseased individuals. Aligning the reference and query can capture matches and mismatches (middle), supporting further downstream analysis (right). b, Different alignment states and their theoretical origins. Dynamic time warping and biological sequence alignment complement each other,, when capturing matches (including warps) and indels (left). An alignment (nonlinear mapping) between time points of the discretized reference (R) and query (Q) trajectories shown in a (middle). Between a reference time point Rj and query time point Qi, there may exist five different states of alignment: 1-1 match (M), warps (1-to-many expansion (V) or many-to-1 compression (W) match) and mismatch (insertion (I)/deletion (D) denoting a significant difference in one system compared to the other) (right). c, Example alignment path across a pairwise time point matrix between R and Q trajectories. Diagonal lines (green) refer to matches; vertical lines refer to either insertions (red) or expansion warps (green); horizontal lines refer to deletions (red) or compression warps (green). Any matrix cell (i,j) denotes the pairing of two Rj and Qi time points. d, An example gene alignment generated by the Genes2Genes framework. Interpolated log1p-normalized (per-cell total raw transcript count normalized to 10,000 and log1p-transformed) expression (y axis) between reference (green) and query (blue) against their pseudotime (x axis) (left). The bold lines represent mean expression trends and faded data points are 50 random samples from the estimated expression distribution at each time point. Black dashed lines visualize matches (including warps) between time points. Corresponding nonlinear mapping between R and Q time points shown in the left (right). Corresponding five-state alignment string where subsequences over [M,V,W] and [I,D] denote matched regions and mismatched regions, respectively (bottom). Illustrations in ac were created using BioRender (https://biorender.com).
Fig. 2
Fig. 2. Overview of the Genes2Genes alignment framework and workflow for comparing single-cell transcriptomic trajectories.
Given log1p-normalized cell-by-gene expression matrices of a reference (R) and query (Q) and their pseudotime estimates, G2G infers individual alignments for all genes of interest. It first interpolates data by extending mean-based interpolation in Alpert et al. (2018) to distributional interpolation and then runs Gotoh’s DP algorithm adapted for all the five alignment states (M,W,V,I,D) defined in Fig. 1b. All reported alignments are then clustered and used to deliver statistics on the overall alignment between R and Q, supporting further downstream analyses. The DP algorithm utilizes a match cost function defined under MML inference framework (top left). Given a hypothesis (model) and data, MML defines the total message length of encoding them for lossless compression along an imaginary message transmission. G2G defines two hypotheses: (1) Φ: Rj and Qi time points mismatch and (2) A: Rj and Qi time points match. Under Φ, the message length is the sum of independent encoding lengths of their interpolated expression data and corresponding Gaussian distributions. Under A, the message length is the joint encoding length of their interpolated expression data under a single Gaussian distribution (either of Rj or Qi). The match cost is computed as the difference of A and Φ per-datum encoding lengths. The DP algorithm incorporates a symmetric five-state machine which can generate a string over the alphabet, Ω = [M, W, V, D, I] describing the optimal sequential alignment states (Fig. 1b) between R and Q time points (middle left). Each arrow represents a state transition. Arrows with the same hatch mark implies equal probability of state transition. G2G computes a pairwise Levenshtein distance matrix across all five-state alignment strings to cluster genes of similar alignment pattern (bottom left). Example output of five-state alignment strings for all genes (top right). Example clustermap showcasing the clustering structure of alignments resulted from agglomerative hierarchical clustering (bottom right). The color represents the Levenshtein distance. Illustrations were created using BioRender (https://biorender.com).
Fig. 3
Fig. 3. Genes2Genes outperforms the current state-of-the-art of trajectory alignment.
a, Differences in the algorithms and outputs of CellAlign, TrAGEDy and G2G. CellAlign runs DTW, defining the state space Ω = [M, W, V] (Fig. 1b). TrAGEDy performs DTW post hoc processing, while G2G unifies DTW and gap modeling. Both of them define the state space Ω = [M, W, V, I, D] (Fig. 1b). b, Comparing features across CellAlign, TrAGEDy and G2G. c, A Gaussian process-based simulator is used to generate 3,500 simulated pairs of reference and query gene trajectories for benchmarking G2G against CellAlign and TrAGEDy, testing under three main classes of alignment patterns: matching, divergence and convergence. divergence and convergence are subcategorized based on their approximate time of bifurcation (early, mid and late), resulting in seven total patterns (each with 500 alignments). d, The three-state, cell-level alignment generated by CellAlign for each pattern (under 15 equispaced time points). e, The five-state, cell-level alignments generated by both modes of TrAGEDy (referred to as TrAGEDyMINIMUM and TrAGEDyNULL) and G2G. f, Percentages of accurate alignments by TrAGEDy and G2G across all patterns (left). Clustergram of the pairwise Levenshtein distance matrix across all G2G alignments, separating the distinct patterns using agglomerative hierarchical clustering (right). g, Comparing hierarchical clustering of the gene alignments generated by CellAlign, TrAGEDy and G2G; x axis is the number of clusters (representing varying clustering resolutions) in log scale; y axis is the mis-clustering rate (outlier percentage across all clusters). h, Cell-level alignment of two simulated trajectories with no shared process, with three example gene alignments generated by TrAGEDy and G2G. Five-state alignment strings from each method (left) and expression plots (right) of the three example genes. Column 1 shows interpolated gene expression (y axis) against pseudotime (x axis). The bold lines represent mean expression trends, while the faded data points are 50 random samples from the estimated expression distribution at each time point as generated under G2G. Columns 2–3 show the actual log1p-normalized expression (y axis) against pseudotime (x axis). Each point represents a cell. Illustrations in ac were created using BioRender (https://biorender.com). All interpolations and alignment statistics were generated using our G2G framework.
Fig. 4
Fig. 4. Genes2Genes captures matches and mismatches at gene-level resolution.
a, G2G alignment on a published time-course dataset, of murine bone-marrow-derived dendritic cells stimulated with PAM (reference) or LPS (query). b, Aggregate alignment over the alignments of 99 ‘core antiviral’ genes (top). Stacked barplots represent reference and query cell compositions across 14 equispaced pseudotime points, colored by post-stimulation sampling time; boxed segments represent mismatches; black lines represent matches. Pairwise time point matrix between reference and query (bottom). Color represents total gene count showing a match between corresponding time points. White line represents the average alignment path. c, Gene expression of three representative core antiviral genes (IRF7, STAT2 and IFIT1) in query (blue) and reference (green). Interpolated log1p-normalized (per-cell total raw transcript counts normalized to 10,000 and log1p-transformed) expression (y axis) against pseudotime (x axis) (left). Bold lines represent mean expression trends and faded data points indicate 50 random samples from the estimated expression distribution at each time point. Black dashed lines represent time point matches (captured by the alignment string below). Actual log1p-normalized expression (y axis) against pseudotime (x axis) (right). Each point represents a cell. Red circles highlight early cells (‘precocious expressers’) with high expression. d, Same plots as b for 89 ‘peaked inflammatory’ genes, clustered following their alignments (Extended Data Fig. 2). Dashed, colored lines represent example cluster-specific alignment paths. e, Same plots as c for representative genes (CXCL2, PLK2, CXCL1 and CD44) from each cluster shown in d. f, Alignment similarity (y axis) against log2 fold change of mean expression (x axis) for peaked inflammatory genes (middle). Color represents alignment similarity. Surrounding plots show interpolated log1p-normalized expression (y axis) against pseudotime (x axis) on the left and the gene expression violin plot on the right, for four selected genes (SGMS2, CCRL2, TNF and C5AR1). Green and blue violin plots include n = 179 PAM-stimulated and n = 290 LPS-stimulated cells, respectively. Violin shows expression distribution across cells as a kernel density estimation. The box inside each violin shows the interquartile range (25–75% quantiles, with a point indicating median). The illustration in a was created using BioRender (https://biorender.com). All interpolations and statistics were generated using our G2G framework.
Fig. 5
Fig. 5. Genes2Genes compares cell differentiation trajectories between healthy lung and disease lung in idiopathic pulmonary fibrosis.
a, Schematic of the healthy and IPF cell differentiation trajectories of focus, that is, differentiation of alveolar type 2 (AT2) cells into alveolar type 1 (AT1) cells in the healthy lung (reference) versus ABCs in the IPF lung (query). b, Aggregate alignment over the alignments of all highly variable genes (HVGs) (top). Stacked barplots represent reference and query cell-type compositions across 13 equispaced pseudotime points; boxed segments represent mismatches; black lines represent matches. The pairwise time point matrix between healthy and IPF pseudotime (bottom). Color represents total gene count showing a match between corresponding healthy and IPF time points. White line represents the average alignment path. c, Aggregate alignment over the alignments of 88 ABC marker genes (Supplementary Fig. 3) plotted as in b, with the aggregate alignment schematic on top, and the pairwise time point matrix in the middle. Gene expression plots for three example ABC marker genes (KRT17, MMP7 and FN1) between IPF (blue) and healthy (green) data along pseudotime, plotting interpolated log1p-normalized (per-cell total raw transcript counts normalized to 10,000 and log1p-transformed) expression (y axis) against pseudotime (x axis) (bottom). Bold lines represent mean expression trends; faded data points are 50 random samples from the estimated expression distribution at each time point. Black dashed lines represent matches between time points. d, Aggregate alignment path (white) for all EMT pathway genes, plotted on the pairwise time point matrix between healthy and IPF as in b, with the schematic on the right (top right). Heatmap of the smoothened (interpolated) and z-normalized mean log1p gene expression of genes in the EMT pathway along pseudotime (bottom right). e, Gene expression of CAMK1D between IPF (blue) and healthy (green) along pseudotime. Interpolated log1p-normalized expression (y axis) against pseudotime (x axis) as in c (top). Actual log1p-normalized gene expression versus pseudotime plots (bottom). The illustration in a was created using BioRender (https://biorender.com). All interpolations and statistics were generated using our G2G framework.
Fig. 6
Fig. 6. Genes2Genes aligns in vivo, in vitro human T cell development.
a, Schematic illustration of T cell development in the human thymus. b, Aggregate alignment over the alignments of 1,371 TFs between in vitro organoid (ATOs) and in vivo human T cell developmental trajectories, shown in the pairwise time point matrix between organoid and reference. Color represents total gene count showing a match between corresponding time points. White lines represent the average alignment path. Stacked barplots represent reference (top) and query (left) cell-type compositions across 14 equispaced pseudotime points. c, Aggregate alignment over all TFs in the pluripotency signaling pathway, plotted on the pairwise time point matrix (top left) as in b; schematic of this mapping between reference and organoid cell-type compositions across pseudotime; boxed segment represents the mismatched ATO pluripotency stage; black lines represent matches. Interpolated log1p-normalized (per-cell total raw transcript counts normalized to 10,000 and log1p-transformed) expression (y axis) against pseudotime (x axis) for selected genes (bottom left). Heatmap of the smoothened (interpolated) and z-normalized mean gene expression along pseudotime (bottom right). d, Same plots as c for all TFs in the TNF signaling via NF-κB pathway. The boxed segment in the right-top plot represents the mismatched last stage in vivo T cell maturation. e, Schematic illustration of potential targets for further optimization of in vitro T cell differentiation toward either type 1 innate T cells or conventional CD8+ T cells. f, Schematic illustration of the comparison between SP T cells from the wild-type ATOs and the TNF-treated ATOs against in vivo type 1 innate T cells. SP T cells from ATOs after TNF treatment show more maturity toward in vivo type 1 innate T cells. g, Heatmap of mean log1p-normalized gene expression of TFs within the TNF signaling via NF-κB pathway (same gene list as in d) in reference (in vivo type 1 innate T cells), SP T cells from wild-type ATOs and SP T cells from TNF-treated ATOs (ATOTNF). Illustrations in a,e,f were created using BioRender (https://biorender.com). All interpolations and statistics were generated using our G2G framework.
Extended Data Fig. 1
Extended Data Fig. 1. Simulating the bifurcation of reference and query trajectories using change point kernels.
(Note: A change point kernel defines shifts and changes in covariance between discrete time points in a time series that describes a particular Gaussian process. In the context of a single-cell pseudotime trajectory, each discrete time point corresponds to a single cell. The change point kernel can be represented by a pairwise covariance matrix between those time points, visualized using heatmaps). a, Change point kernel heatmaps for each approximate bifurcation point (change point) [0.25,0.5,0.75]. b, The same change point kernels binarized based on the 0.01 covariance threshold (top), c, The average covariance plotted for each i×i sub square matrix from i=0 to i= change point, showing that the branching effect can approximately start before the specified change point. d, Expected bifurcation region is taken from the point where we begin to see > 0.01 covariance in the change point kernel, until the particular change point. e, Illustration of the main regions of match and mismatch expected in trajectory alignment under Divergence class (left) and Convergence class (right). A Divergence alignment is described by a start-match region followed by an end-mismatch region, whereas a Convergence alignment is described by a start-mismatch region followed by an end-match region. Illustrations in d-e were created using BioRender (https://biorender.com).
Extended Data Fig. 2
Extended Data Fig. 2. Simulation Data Experiment 1.
a, Distributions of start-match lengths (following a false mismatch if any), end-mismatch lengths (prior to a false match if any), start-mismatch (false mismatch) lengths (number of false mismatches starting from time point 0), and the number of intermediate false mismatches within the match regions, in the 1500 Divergence alignments across the three bifurcation subgroups (that is, under approximate bifurcation point [0.25,0.5,0.75]; each subgroup has 500 alignments), generated by Genes2Genes, TrAGEDyMINIMUM, and TrAGEDyNULL. 15 equispaced time points over pseudotime [0,1] were used for distribution interpolation and alignment. Colored boxes (in blue, orange, and green) in the two leftmost columns display possible ranges of expected match lengths corresponding to the three different, approximate bifurcation points: [0.25,0.5,0.75], respectively. Each violin plot shows the length distribution across all n = 500 alignments in each group as a kernel density estimation. Inside the violin is a box showing the interquartile range (covering the 25% and 75% quantiles with a point indicating median). b, Similar statistics as in a, reported for the 1500 Convergence alignments across the three bifurcation subgroups, generated by Genes2Genes, TrAGEDyMINIMUM, and TrAGEDyNULL. Distributions of end-match lengths (prior to a false mismatch if there is any), start-mismatch lengths (following a false match if there is any), end-mismatch lengths (number of false mismatches until time point 1), and the number of intermediate false mismatches within the match regions. c, Cluster diagnostic plots for the hierarchical agglomerative clustering of the 3500 alignments across all pattern classes (including 500 Matching alignments, 1500 Divergence alignments, 1500 Convergence alignments), in terms of the mean Silhouette score when varying the Levenshtein distance threshold (or the number of clusters). The highest number of clusters represent the number of all unique 5-state alignment strings (that is 113 strings). Bold highlighted circles mark the local optimal mean Silhouette score which gives 15 optimal clusters for the genes at 0.22 distance threshold. d, Mis-clustering rates of the CellAlign k-means clustering outputs for all 3500 alignments, versus the number of clusters (k) ranging from k = 7 to k = 50.
Extended Data Fig. 3
Extended Data Fig. 3. Simulation Data Experiment 2.
a, Experiment 2 uses 769 genes in the mouse pancreas development (Beta lineage) scRNA-seq dataset to generate perturbed pairs of alignment from the expected Matching alignments. Perturbation scenario 1 deletes the start region from the reference trajectory, whereas perturbation scenario 2 changes the start region of the query trajectory. b, Alignment similarity distributions for varying sizes of perturbation (perturbed percentage of the 50 pseudotime interpolation points) under perturbation scenario 1, resulted from gene-level alignment using TrAGEDyMINIMUM, TrAGEDyNULL, and Genes2Genes. Each point in the violin plot represents a gene (total number of genes n = 769). In each plot, the observed average alignment similarity across different perturbation sizes is shown by the green line. Blue line shows the expected alignment similarity across different perturbation sizes. Each violin plot shows the distribution of alignment similarities across all gene alignments in each group as a kernel density estimation. Inside the violin is a box showing the interquartile range (covering the 25% and 75% quantiles with a point indicating median). c, The alignment similarity distributions for varying sizes of perturbation under perturbation scenario 2 similar to b. There are two expected lines: maximum (in blue) and minimum (in red). The maximum mismatch length is expected when both reference and query time points form insertions and deletions, making the maximum expected length size*2. The minimum mismatch length is expected when only the changed reference time points are mismatched as insertions, while the corresponding query time points are matched to the non-perturbed reference time points (illustrated in e). d, Overall smoothened (interpolated) and z-normalized mean gene expression along pseudotime (across 50 equispaced interpolation time points) for all genes in the dataset. e, Example illustrations of the two types of trajectory alignment that gives the minimum and maximum expected mismatch lengths under the perturbation scenario 2, where a start portion of a particular size in the query trajectory (in blue) is changed with respect to the reference trajectory (in green). Illustrations in a,e were created using BioRender (https://biorender.com).
Extended Data Fig. 4
Extended Data Fig. 4. The PAM vs. LPS alignment using Genes2Genes.
a, Density plot of the alignment similarity distribution (that is distribution of the percentage of matches/warps across all the alignment outputs) for the 99 genes in the ‘core antiviral module’. b, Top: Cluster diagnostic plots for the hierarchical agglomerative clustering of those 99 gene alignments in terms of the mean Silhouette score when varying the Levenshtein distance threshold (or the number of clusters). The highest number of clusters represent the number of all unique 5-state alignment strings (that is 48 strings). Bold highlighted circles mark the local optimal mean Silhouette score which gives two optimal clusters for the genes at 0.4 distance threshold. Bottom: Each plot titled by “Cluster-x | n” is the pairwise matrix of reference time points (columns) and query time points (rows), visualizing alignment paths for the total of n genes in cluster x. c, Density plot of the alignment similarity distribution (that is distribution of the percentage of matches/warps across all the alignment outputs) for the 89 genes in the ‘peaked inflammatory module’. d, Cluster diagnostic plots for the hierarchical agglomerative clustering of those 89 gene alignments, reported similarly to b. The identified optimal clustering structure has 7 clusters (at distance threshold=0.37 corresponding to the local optimal mean Silhouette score, highlighted by bold circles). e, The clustermap of the pairwise Levenshtein distance matrix of all 89 gene alignments, which illustrates the identified 7 clusters.
Extended Data Fig. 5
Extended Data Fig. 5. The healthy versus Idiopathic Pulmonary Fibrosis (IPF) disease case study alignment clustering outputs.
a, Density plot of the alignment similarity distribution (that is distribution of the percentage of matches/warps across all the alignment outputs) for the 994 highly variable genes in the dataset. b, Cluster diagnostic plots for the hierarchical agglomerative clustering of those 994 gene alignments in terms of the mean Silhouette score when varying the Levenshtein distance threshold (or the number of clusters). The highest number of clusters represent the number of all unique 5-state alignment strings (that is 325 strings). Bold highlighted circles mark the local optimal mean Silhouette score which gives nine optimal clusters for the genes at 0.48 distance threshold. c, The identified clustering structure. Left: Each plot titled by “Cluster-x | n” is the pairwise matrix of reference and query time points, visualizing alignment paths for all the genes (one alignment per gene and a total of n genes in the cluster) in a cluster x. Right: The clustermap of the pairwise Levenshtein distance matrix of all 994 gene alignments.
Extended Data Fig. 6
Extended Data Fig. 6. In vivo, in vitro human T cell development data integration and pseudotime inference.
a, Top: schematic showing the experimental setup of T cell differentiation from iPSCs in ATOs. Bottom: barplot of cell type composition in ATO at different time points during differentiation. b, UMAP visualization of different cell types in the ATO dataset (low-level annotation, number of cells n = 31,483), with more refined annotation in Extended Data Fig. 7a. c, Workflow of integrating in vitro (that is ATO) and in vivo (that is pan fetal reference from Suo et al.) human T cell development data and pseudotime inference using GPLVM. d, Main: UMAP visualization of integrated in vivo and in vitro human T cell development data, colored by the cell types. Right insert: the same UMAP visualization colored by the data source. e, Stripplot of the inferred pseudotime (x axis) against different cell types (y axis), colored by the cell types, of in vivo pan fetal reference data (top) and in vitro organoid data (bottom). Illustrations in a,c were created using BioRender (https://biorender.com).
Extended Data Fig. 7
Extended Data Fig. 7. Analysis of artificial thymic organoid scRNA-seq data.
a, UMAP visualization of different cell types in the ATO (refined annotation). iPSC: induced pluripotent stem cell, HSC_MPP: hematopoietic stem cell, and multipotent progenitor, LMPP_MLP: lymphoid-primed multipotent progenitor and multi lymphoid progenitor, DC: dendritic cell, CMP: common myeloid progenitor, GMP: granulocyte and monocyte progenitor, MK: megakaryocyte, MEP: megakaryocyte erythroid progenitor, YS_ERY: yolk sac-like erythrocyte, EARLY_ERY: early erythrocyte, MID_ERY: mid-stage erythrocyte, DN(EARLY) T: early double negative T cell, DN T: double negative T cell, DP(P) T: proliferating double positive T cell, DP(Q) T: quiescent double positive T cell, SP T: single positive T cell, NK: natural killer cell, ILC: innate lymphoid cell. b, Predicted annotations from logistic regression model with CellTypist using the developing human immune atlas as the training dataset, overlaid on the same UMAP plot as in a.
Extended Data Fig. 8
Extended Data Fig. 8. Annotation of artificial thymic organoid scRNA-seq data.
For each subset lineage embedding generated through scVI, we show UMAP embeddings of cells colored by annotated cell populations and dot plots of mean expression (log-normalized counts, dot color) and fraction of expressing cells (dot size) of marker genes (columns) used for cell population annotation (rows). a, Annotation of non-hematopoietic cells. b, Annotation of T/ILC/NK lineage cells. c, Annotation of other hematopoietic cells that are not in b.
Extended Data Fig. 9
Extended Data Fig. 9. Pan fetal reference vs artificial thymic organoid alignment clustering outputs.
a, Density plot of the alignment similarity distribution (that is distribution of the percentage of matches/warps across all the alignment outputs) for all 1371 transcription factors. b, Cluster diagnostic plots for the hierarchical agglomerative clustering of those 1371 TF alignments in terms of the mean Silhouette score when varying the Levenshtein distance threshold (or the number of clusters). The highest number of clusters represent the number of all unique 5-state alignment strings (that is 355 strings). Bold highlighted circles mark the local optimal mean Silhouette scores which give 22 optimal clusters for the genes at 0.45 distance threshold (low resolution), and 136 clusters at 0.18 distance threshold (high resolution). c, The identified clustering structure. Left: Each plot titled by “Cluster-x | n” is the pairwise matrix of reference and query time points, visualizing alignment paths for all the genes (one alignment per gene and a total of n genes in the cluster) in a cluster x. Right: The clustermap of the pairwise Levenshtein distance matrix of all TF alignments. Bottom: Identified interesting clusters (that is Cluster 2 representing early mismatched TFs, Cluster 0 representing middle mismatched TFs, Cluster 5 & 10 representing almost 100% mismatched TFs), with their aggregate alignments as 5-state strings.
Extended Data Fig. 10
Extended Data Fig. 10. Further downstream analysis of in vivo, in vitro human T cell development alignment.
a, Genes2Genes aggregate alignment for all 196 genes in the TNFα signaling via NF-κB pathway, between in vitro organoid (ATO) and in vivo reference. Left: pairwise time point matrix between reference and organoid. Color represents total gene count showing a match between corresponding time points. White line represents the average alignment path. Right: Schematic aggregate mapping between pseudotime points. Stacked bar-plots represent reference and organoid cell-type compositions across 14 equispaced pseudotime points; Black lines represent matches. b, Aggregate alignment similar to a, between in vivo Type 1 Innate T cell reference (T1) and ATO across 1220 TFs (left), and CD8+ Reference and ATO across 1219 TFs. c, Alignment differences between in vivo conventional CD8+ T lineage versus organoid, and T1 lineage versus organoid. Middle: alignment similarity difference (y axis) against log2 fold change of mean expression between CD8 + T and T1 cells (x axis). Color reflects the absolute value of alignment similarity difference. Surrounding plots: the interpolated log1p-normalized (that is per-cell total sum of the raw transcript counts normalized to 10,000 and log1p-transformed) expression (y axis) against pseudotime (x axis) showing the alignment between T1 lineage (green) and ATO (blue) (top), and the alignment between CD8 + T lineage (orange) and ATO (blue) (bottom), for four selected genes. Bold lines represent mean expression trends; Faded data points are 50 random samples from the estimated expression distribution at each time point. Black dashed lines represent matches. d, Heatmap of mean log1p-normalized gene expression of all 196 genes within TNF signaling pathway in in vivo type 1 innate T cells, ATO SP T cells, and TNF-treated-ATO SP T cells. e, Gene-level alignments (similar to plots in c) for five, single positive (SP) T cell maturity markers (IL7R, KLF2, S1PR1, SELL, FOXO1) between T1 and ATO. f, Mean log1p-normalized gene expression of those marker genes compared across reference type 1 innate T cells, ATO SP T cells, and TNF-treated-ATO SP T cells. Illustrations in d-e were created using BioRender (https://biorender.com). All interpolations and statistics were generated using our Genes2Genes framework.

References

    1. Schier, A. F. Single-cell biology: beyond the sum of its parts. Nat. Methods17, 17–20 (2020). - PubMed
    1. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol.37, 547–554 (2019). - PubMed
    1. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol.32, 381–386 (2014). - PMC - PubMed
    1. Bellman, R. The theory of dynamic programming. Bull. Am. Math. Soc.60, 503–515 (1954).
    1. Vintsyuk, T. K. Speech discrimination by dynamic programming. Cybernetics4, 52–57 (1972).

LinkOut - more resources