. 2023 Feb;25(2):337-350.

doi: 10.1038/s41556-022-01072-x. Epub 2023 Feb 2.

Biologically informed deep learning to query gene programs in single-cell atlases

Mohammad Lotfollahi^#^{1

2}, Sergei Rybakov^#^{1

3}, Karin Hrovatin^{1

4}, Soroor Hediyeh-Zadeh^{1

5}, Carlos Talavera-López^{1

6}, Alexander V Misharin⁷, Fabian J Theis^{8

9

10

11}

Affiliations

¹ Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
² Wellcome Sanger Institute, Cambridge, UK.
³ Department of Mathematics, Technical University of Munich, Munich, Germany.
⁴ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
⁵ Bioinformatics Division, WEHI, Melbourne, Victoria, Australia.
⁶ Division of Infectious Diseases and Tropical Medicine, Ludwig-Maximilian-Universität Klinikum, Munich, Germany.
⁷ Division of Pulmonary and Critical Care Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
⁸ Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.
⁹ Wellcome Sanger Institute, Cambridge, UK. fabian.theis@helmholtz-muenchen.de.
¹⁰ Department of Mathematics, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.
¹¹ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.

^# Contributed equally.

PMID: 36732632
PMCID: PMC9928587
DOI: 10.1038/s41556-022-01072-x

Biologically informed deep learning to query gene programs in single-cell atlases

Mohammad Lotfollahi et al. Nat Cell Biol. 2023 Feb.

. 2023 Feb;25(2):337-350.

doi: 10.1038/s41556-022-01072-x. Epub 2023 Feb 2.

Authors

Mohammad Lotfollahi^#^{1

2}, Sergei Rybakov^#^{1

3}, Karin Hrovatin^{1

4}, Soroor Hediyeh-Zadeh^{1

5}, Carlos Talavera-López^{1

6}, Alexander V Misharin⁷, Fabian J Theis^{8

9

10

11}

Affiliations

¹ Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
² Wellcome Sanger Institute, Cambridge, UK.
³ Department of Mathematics, Technical University of Munich, Munich, Germany.
⁴ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
⁵ Bioinformatics Division, WEHI, Melbourne, Victoria, Australia.
⁶ Division of Infectious Diseases and Tropical Medicine, Ludwig-Maximilian-Universität Klinikum, Munich, Germany.
⁷ Division of Pulmonary and Critical Care Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
⁸ Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.
⁹ Wellcome Sanger Institute, Cambridge, UK. fabian.theis@helmholtz-muenchen.de.
¹⁰ Department of Mathematics, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.
¹¹ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.

^# Contributed equally.

PMID: 36732632
PMCID: PMC9928587
DOI: 10.1038/s41556-022-01072-x

Abstract

The increasing availability of large-scale single-cell atlases has enabled the detailed description of cell states. In parallel, advances in deep learning allow rapid analysis of newly generated query datasets by mapping them into reference atlases. However, existing data transformations learned to map query data are not easily explainable using biologically known concepts such as genes or pathways. Here we propose expiMap, a biologically informed deep-learning architecture that enables single-cell reference mapping. ExpiMap learns to map cells into biologically understandable components representing known 'gene programs'. The activity of each cell for a gene program is learned while simultaneously refining them and learning de novo programs. We show that expiMap compares favourably to existing methods while bringing an additional layer of interpretability to integrative single-cell analysis. Furthermore, we demonstrate its applicability to analyse single-cell perturbation responses in different tissues and species and resolve responses of patients who have coronavirus disease 2019 to different treatments across cell types.

PubMed Disclaimer

Conflict of interest statement

F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd and Omniscope Ltd, and has ownership interest in Dermagnostix GmbH and Cellarity. The other authors declare no competing interests.

Figures

**Fig. 1. Biologically informed reference mapping using expiMap.**
a,b, Domain knowledge from databases, articles and expert knowledge (a) is used to construct a binary matrix of GPs (b). c, The model is trained on reference data, received gene expression and study labels for each cell to encode a set of latent variables representing GPs. The GPs are pruned and enriched by the model using a group lasso and gene-level sparsity regularization, respectively, and fed into a linear decoder. The GP matrix is then used to program the neural network architecture by wiring the model parameters of the decoder to learn a specific GP for each latent dimension. d, The reference model is expanded and fine-tuned upon mapping query data using architecture surgery, whereas new learnable latent GPs are added and trained with the query data. The decoder architecture equals c with the difference that only highlighted weights of newly added GPs are trainable in the encoder and decoder. To make sure these newly learned unconstrained GPs do not overlap with reference GPs, we employ statistical independence constraints.

**Fig. 2. ExpiMap resolves GPs after IFN-β perturbation.**
a, UMAP representation of the query control and IFN-β-stimulated cells from eight patients (n = 13,576 cells) mapped onto a healthy immune reference from four different studies (n = 32,484 cells) using expiMap. Colours demonstrate study (left), harmonized cell type (middle) and data source (right). HSPCs, haematopoietic stem and progenitor cells. b, Differential GP analysis results between query IFN-β and control cells from the query and reference. The x axis shows the ranking of GPs; the y axis denotes the significance (absolute log-Bayes factor) of each GP. c, Visualization of both the reference and query data in the context of the top two most significant expiMap latent GPs in b. Each dot shows the latent GP score of each cell. d, Visualization of the query and reference in various GPs, delineating cell types or perturbation states for B cells and CD14⁺/16⁺ monocytes. e, The activity of the most differentially active GP terms in CD14⁺ monocytes after IFN-β stimulation. Each violin plot demonstrates the distribution of latent GP values across different cell types. The dashed square highlights GPs characterizing the myeloid-specific response to IFN-β. Source data

**Fig. 3. Domain awareness improves performance in downstream tasks.**
a, UMAP representation of integrated healthy immune reference with query interferon IFN-β data from eight patients for expiMap and existing reference mapping methods. Colours denote the data source and cell type. The dotted circle highlights query control monocytes that scArches + scVI failed to integrate into the control reference. b, Comparison of integration accuracy for mapping control query cells (excluding IFN-β cells) onto healthy atlases across different models. The metrics measure batch correction and bioconservation. The dotted line is the overall score calculated on the basis of the mean of all metrics. c, expiMap retains the expressiveness of an unconstrained reference model, as shown by the comparison of reference building performance through benchmarking in five different tissues, including PBMCs (n = 161,764, n_batches = 8), heart (n = 18,641, n_batches = 4), lung (n = 65,662, n_batches = 19), colon (n = 34,772, n_batches = 12) and liver (n = 113,063, n_batches = 14) across three different methods. The y axis is the average score of the nine metrics detailed in b. PC regression, principal component regression. Source data

**Fig. 4. Learning new GPs from query data.**
a, Distribution of single-cell latent representation values across newly learned GPs across different query data cell types for query IFN-β-treated cells and control cells. b,c, Comparison of overlap of the most influential genes dominating the variance in newly learned constrained B-cell nodes (b) and unconstrained nodes (c) with genes in existing related GPs and top genes obtained from the differential testing analysis. The terms ‘MYELOIDS_DEG’ and ‘B_CELLS_DEG’ refer to genes obtained from one versus all Wilcoxon rank-sum tests in the query control cells for each population, respectively. The myeloid population consists of CD14⁺ monocytes, CD16⁺ monocytes and DC populations. ‘INF_VS_CTRL_DEG’ denotes differentially expressed genes comparing IFN-β-treated and control cells. The existing GPs for c are those with maximal overlap with at least 12 genes with newly learned GPs. d–f, Visualization of newly learned GPs (for cells from the reference and query datasets with cell types present in the query dataset) discriminating specific cell types and states from the rest, such as B cells and myeloids with the effect of IFN removed (d) or B cells with the effect of IFN preserved (e,f). g–i, UMAP of expiMap’s latent space for the query dataset coloured by node 3 latent representation values (g), TMSB4X gene expression counts (h) and cell types (i). The dotted circle highlights DCs. Source data

**Fig. 5. expiMap analysis highlights the importance of the annexin gene family communication pathway during moderate and severe COVID-19.**
a, Illustration of the integrated datasets from PBMCs of healthy controls, patients with severe COVID treated with tocilizumab, and patients in the remission stages, and in vitro IFN-stimulated PBMCs. Figure made with BioRender. b, Integrated manifold using expiMap showing combined healthy PBMCs (n = 32,484), two query datasets including two patients with COVID-19 (n = 18,752) and the IFN-β dataset (n = 13,576) (ref. ). c, Detailed cell type annotation of the integrated PBMC datasets. Red circles highlight cells not merged with the healthy PBMC cell atlas. ModDC, monocyte-derived dendritic cells; CD14⁺ Mo, CD14⁺ monocytes; CD16⁺ Mo, CD16⁺ monocytes; pDC, plasmacytoid dendritic cells; pB, plasma B cells. d,e, Distribution plots for differential GP activities were obtained using expiMap for CD8⁺ T cells and CD14⁺ monocytes, highlighting the antiviral transcriptional programs for *RIG-I*/*MDA5* and *GPCRs* in each population. ILS, interleukins. Scatter plots are latent GPs representations of highlighted GPs for each cell type. f, Annexin communication pathways in different stages of COVID. In the severe stage (P1D1), CD14⁺ and CD16⁺ monocytes participate in a dynamic communication activity via annexins with NK and CD8⁺ T cells. This circuit converges to focused signalling to CD16⁺ monocytes during COVID remission (P1D5). In P2, CD14⁺ monocytes receive focused annexin signalling from NK, CD8⁺ and CD4⁺ T cells in the severe stage (P2D1), and later converge to signalling to CD14⁺ monocytes from the same lymphoid effectors during remission (P2D5). Source data

**Fig. 6. Reference mapping of pancreatic islet cells using expiMap.**
a, Pancreatic islet cell analysis. The expiMap model was trained on heterogeneous non-T2D mouse pancreatic islet cells from different datasets. A dataset containing healthy and treated T2D-model cells was mapped to this reference. expiMap was trained with GPs from PanglaoDB to evaluate cell type annotation and scores from Reactome to determine metabolic differences between healthy and T2D-model beta cells. b, expiMap-integrated UMAP coloured by dataset shows three reference Pancreas datasets (45,178 cells) and one query dataset (36,899 cells). c, Healthy and T2D-model beta cells from the reference and query separate on UMAP. d, The expiMap score for immune B cells highlights a subpopulation of cells previously annotated under the umbrella term of immune cells. The score for acinar cells helps annotate the small acinar cell type cluster, which was not annotated in automatic cell type transfer owing to low classifier certainty. e, Low redundancy of the top differential Reactome pathways between healthy and T2D-model query beta cells. Genes (columns) associated with each GP (rows) are marked in white; the absence of a gene in a GP is indicated by dark colour. The displayed matrix was clustered both by genes and GPs. f, The immune interaction GP scores in insulin-treated T2D-model beta cells from the query are bimodally distributed. g,h, Beta-cell scores of selected GPs differentially active in T2D-model beta cells. Comparison of UPR and protein synthesis and processing GP activites (g) and comparison of N-linked glycosylation and immune GP activity (h). Legend is shown in c. On the first subplot, circles mark the T2D-model population with relatively high scores in UPR and mRNA metabolism compared with healthy control from the query, whereas other non-T2D cells from the reference show high mRNA metabolism without a high UPR score. The circle indicates the T2D-model population with extreme UPR and asparagine N-linked glycosylation scores. ref: reference datasets, other samples are from the query; STZ: streptozotocin T2D model; STZ_GLP-1: STZ treated with GLP-1; STZ_estrogen: STZ treated with oestrogen; STZ_GLP-1_estrogen: STZ treated with GLP-1-oestrogen conjugate; STZ_insulin: STZ treated with insulin; STZ_GLP-1_estrogen+insulin: STZ treated with GLP-1-oestrogen conjugate and insulin; control: healthy control. Source data

**Extended Data Fig. 1. Differential GP analysis results for cell types from the integrated query and reference PBMCs with five datasets.**
Differential GP analysis results between cell types (one vs all test) for cell types in the query data. The x-axis is the ranking of GPs; the y-axis denotes the significance (absolute log-Bayes factor) of each GP. Source data

**Extended Data Fig. 2. Comparison of expiMap Bayes Factors with GSEA –log_10(FDR).**
For comparison, we show Bayes factors from expiMap and FDR from fry. **(a)** Overall stimulated vs control tests (all cell types are pooled together). **(b)** Results for CD14⁺ monocytes stimulated vs control tests. **(c)** B cells vs the rest of the cell types. **(d)** CD16⁺ monocytes vs the rest of the cell types, **(e)** CD14⁺ monocytes vs the rest of the cell types. The x-axis shows the negative logarithm of the false discovery rate (-log10FDR) from fry; the y-axis shows the absolute value of the logarithm of the Bayes Factor (|loge(Bayes Factor)|) from the expiMap test. The size of the circles is proportional to the size of the gene set. We observed an overall agreement between expiMap and conventional GSEA results; however, expiMap detected more specific gene programs in some comparisons, while being computationally efficient with regard to runtime as the differential gene expression and enrichment testing steps no longer have to be repeated for every individual comparison. Source data

**Extended Data Fig. 3. Analysis of GPs characteristics and their relationship to genes.**
**(a)** Distribution of correlation between mean expression value and gene importance score for each gene. X-axis: number of top-scored genes, and the Y-axis denotes the Pearson correlation of mean log-normalized expression for top n genes denoted in x-axis in each GP with their importance scores (each dot represents an active GP, n = 247). **(b-d)** - Scatter plots demonstrating the relation between mean gene expression and gene importance scores from expiMap for GPs, selected from the highlighted group in **(a)**. Example of a GP with a high positive correlation **(b)**, a GP with a correlation near zero **(c)**, and a GP with a negative correlation **(d)**. The X-axis shows the mean log-normalized expression of genes, y - gene importance score in a GP. Correlations are shown for the top 20 genes by gene importance scores. Each dot is a gene. **(e)** Box plot for the entropies of normalized importance scores of the top 50 genes for each active GP (n = 247) divided by the maximal entropy (of uniform distribution). The normalized entropy scale is from 0 (absolutely concentrated) to 1 (uniformly spread weights). Box plot statistics: lower quartile = 0.89, upper quartile = 0.94, median = 0.916, lower whisker = 0.83, upper whisker = 0.998, min = 0.75, max = 0.998. **(f)** Histogram of importance score for top 50 genes in MHC II ANTIGEN PRESENTATION, this GP has normalized entropy 0.799. **(g)** Histogram of importance score for top 50 genes in SIGNALING BY GPCR, this GP has normalized entropy of 0.988. Source data

**Extended Data Fig. 4. Results for expiMap model trained with downsampled query dataset (IFN-β dataset).**
**(a)** UMAP plot for the latent spaces of reference and downsampled query across different query dataset sizes. **(b)** Average of the integration metrics assessing the integration of reference and control query for all query sizes. **(c)** Results of the expiMap Bayes tests between control (reference + query) and IFN-β stimulated (in query). Source data

**Extended Data Fig. 5. Results for expiMap model trained on IFN-β dataset alone.**
**(a)** UMAP plot of expiMap latent space control and IFN-β stimulated cells from eight patients (n = 13,576 cells), used before as a query dataset. Colors demonstrate cell type (left), and condition (right). **(b)** Differential GP analysis results between IFN-β stimulated and control cells. The x-axis shows the ranking of GPs; the y-axis denotes the significance (absolute log-Bayes factor) of each GP. **(c)** Scatter plot of the scores of the top two most significant expiMap GPs in **(b)**. Each dot shows the latent score of each cell. **(d)** Visualization of the scores for various GPs, delineating cell types or perturbation states for B cells and CD14 + /16+ monocytes. **(e)** Differential GP analysis results for CD14 + Monocytes only between IFN-β stimulated and control CD14+ monocyte cells, for both IFN-β dataset only and reference mapping. Source data

**Extended Data Fig. 6. Benchmarking the reference building and assessing subsampling effects on data integration.**
**(a)** Comparison of the reference building performance by benchmarking across five different tissues, PBMCs (n = 161,764), heart (n = 18,641), lung (n = 65,662), colon (n = 34,772), and liver (n = 113,063), and four different methods. **(b)** Subsampling effect on data integration. The overall integration accuracy for different subsamples of PBMCs (n = 161,764) data across different models. The x-axis denotes the proportion of the data used for training each model; the y-axis is the overall average score across nine integration metrics measuring both biological preservation and batch removal, as introduced in Fig. 3b. Source data

**Extended Data Fig. 7. Learning new GPs for query data and benchmarking expiMap for enriching predefined GPs.**
**(a)** Comparison of the top influential genes dominating the variance in node 1 with genes from the top GPs for CD14 + /16+ monocytes from Fig. 2d. (b) Distribution of the number of overlapping genes between top 50 influential genes of new unconstrained nodes and the reference GPs for Fig. 4. Y-axis - number of shared genes between the top 50 genes of the unconstrained new nodes indicated in the x-axis and the reference GPs (n = 276). Each point is the number of shared genes with one reference GP. **(c-e)** Quantifying separations in Fig. 4. **(c)** Results of the differential expiMap test between IFN-β stimulated cells and control cells and B cells and the rest **(d)**. **(e)** Results of the differential expiMap test between Myeloid cells and the rest. In **(c-e)** x-axis - is the rank of the GP, y-axis - is the absolute value of the log Bayes score. **(f-g)** Benchmarking expiMap for enriching predefined GPs. The expiMap model was trained on the PBMCs dataset from Kang et al. (n = 13,576), with [CYTOKINE_SIGNALING_IN_IMMUNE_S’, ‘INTERFERON_ALPHA_BETA_SIGNALING’, ‘ANTIVIRAL_MECHANISM_BY_IFN_STI’, ‘INTERFERON_GAMMA_SIGNALING’, ‘IMMUNE_SYSTEM’] removed from GPs obtained from the Reactome database and only ‘INTERFERON_SIGNALING’ was kept for training. The x-axis shows the number of deleted top genes in the ‘INTERFERON_SIGNALING’ program, while the y-axis shows the percentage of those genes added to the top 30 genes in the ‘INTERFERON_SIGNALING’ program after training. The colors show the different values of L1 sparsity for each experiment. **(f)** The x-axis is the same as in **(g);** the y-axis demonstrates the percentage of the original interferon-related genes among the top 30 genes in the ‘INTERFERON_SIGNALING’ program after training. When the y-axis value is smaller than 1.0, it means that a 1–y percentage of false-positive genes was added to the GP. Source data

**Extended Data Fig. 8. Transcriptional activity of cellular communication circuits occurring in severe COVID.**
Transcriptional activity of genes associated with cell-cell communication pathways and interferon-gamma *(IFNG)* in different cell types and conditions. The cell-cell communication pathways represented are annexins (*ANXA1, FPR1)*, THBS *(THBS1, CD36, CD47)*, ICAM *(ICAM1, ICAM2, ITGAL, SPN, ITGAX, ITGAM, ITGB2)*, LCK, and CCL *(CCL3, CCL4, CCL5)*. Source data

**Extended Data Fig. 9. Comparison of integration results across different integration methods.**
**(a)** UMAPs of integrated embeddings obtained with different integration methods, **(b)** comparison of integration quality across methods, **(c)** PAGA of integrated beta cells indicates that the connection of reference cells with query control cells is the strongest, the connection of T2D-model query cells treated with insulin is moderate, and the connection with other T2D-model cells is the weakest. **(d)** expiMap term scores in beta cells correspond to the known loss of beta cell identity, dedifferentiation, and transdifferentiation in diabetes. Left, loss of beta cell identity (x-axis) vs dedifferentiation-related (y-axis) expiMap terms; right, loss of beta cell identity (x-axis) vs transdifferentiation-related (y-axis) expiMap terms. Source data

**Extended Data Fig. 10. Developmental dataset from mouse endocrinogenesis.**
**(a)** UMAP plot of the latent space of expiMap for mouse endocrinogenesis dataset (n = 25,919) when mapping embryonic day (E) 12.5 and E13.5 to reference containing population from E14.5 and E15.5 colored by cell types. **(b)** UMAP plot of the latent space colored by the latent scores corresponding to the Reactome GP Cell Cycle (b), Developmental Biology (c), and regulation of Beta cell development (d). Developmental Biology and Regulation of Beta cell development GPs were inferred from differential GP analysis across cell types. **(e-g)** Scatter plots of different latent GP scores highlight the separability of different populations. Source data

See this image and copyright information in PMC

References

1. Bartosovic M, Kabbe M, Castelo-Branco G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nat. Biotechnol. 2021;39:825–835. doi: 10.1038/s41587-021-00869-9. - DOI - PMC - PubMed
1. Stoeckius M, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods. 2017;14:865–868. doi: 10.1038/nmeth.4380. - DOI - PMC - PubMed
1. Stoeckius, M. et al. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018). - PMC - PubMed
1. Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021). - PMC - PubMed
1. Lotfollahi M, Wolf FA, Theis FJ. scGen predicts single-cell perturbation responses. Nat. Methods. 2019;16:715–721. doi: 10.1038/s41592-019-0494-8. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Biologically informed deep learning to query gene programs in single-cell atlases

Affiliations

Biologically informed deep learning to query gene programs in single-cell atlases

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical