. 2024 Sep 5;15(1):7762.

doi: 10.1038/s41467-024-51382-x.

scConfluence: single-cell diagonal integration with regularized Inverse Optimal Transport on weakly connected features

Jules Samaran¹, Gabriel Peyré², Laura Cantini³

Affiliations

¹ Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, Paris, France.
² CNRS and DMA de l'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure, Université PSL, Paris, France.
³ Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, Paris, France. laura.cantini@pasteur.fr.

PMID: 39237488
PMCID: PMC11377776
DOI: 10.1038/s41467-024-51382-x

scConfluence: single-cell diagonal integration with regularized Inverse Optimal Transport on weakly connected features

Jules Samaran et al. Nat Commun. 2024.

. 2024 Sep 5;15(1):7762.

doi: 10.1038/s41467-024-51382-x.

Authors

Jules Samaran¹, Gabriel Peyré², Laura Cantini³

Affiliations

¹ Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, Paris, France.
² CNRS and DMA de l'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure, Université PSL, Paris, France.
³ Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, Paris, France. laura.cantini@pasteur.fr.

PMID: 39237488
PMCID: PMC11377776
DOI: 10.1038/s41467-024-51382-x

Abstract

The abundance of unpaired multimodal single-cell data has motivated a growing body of research into the development of diagonal integration methods. However, the state-of-the-art suffers from the loss of biological information due to feature conversion and struggles with modality-specific populations. To overcome these crucial limitations, we here introduce scConfluence, a method for single-cell diagonal integration. scConfluence combines uncoupled autoencoders on the complete set of features with regularized Inverse Optimal Transport on weakly connected features. We extensively benchmark scConfluence in several single-cell integration scenarios proving that it outperforms the state-of-the-art. We then demonstrate the biological relevance of scConfluence in three applications. We predict spatial patterns for Scgn, Synpr and Olah in scRNA-smFISH integration. We improve the classification of B cells and Monocytes in highly heterogeneous scRNA-scATAC-CyTOF integration. Finally, we reveal the joint contribution of Fezf2 and apical dendrite morphology in Intra Telencephalic neurons, based on morphological images and scRNA.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. The scConfluence framework for diagonal integration.**
a Schematic representation of the framework simplified to only two modalities ( $M = 2$ ). While the original data matrices $X^{(1)}$ and $X^{(2)}$ are inputted to their respective autoencoders, converted feature matrices $Y^{(1)}$ and $Y^{(2)}$ (shorter notations for $Y^{(1,2)}$ and $Y^{(2,1)}$ ) are used to compute an Optimal Transport plan across the two modalities. The IOT loss $L_{I O T}$ computed thanks to the transport plan and the regularization loss $L_{r e g}$ constituting together the rIOT constraint, are used to enforce the alignment of modalities in the shared latent space. b Examples of two outputs of scConfluence: cell embeddings can be visualized using 2D projections and clustered to discover new cell subpopulations, they can also be used to impute features across modalities.

**Fig. 2. Benchmarking cell embeddings in unbalanced cell lines.**
a Schematic representation of the benchmarking process. Four scenarios are here considered: removing half of K562 scRNA cells, removing all K562 scRNA cells, and removing completely K562 scRNA cells and HCT scATAC cells; b Purity, Transfer accuracy, Connectivity, and Fraction Of Samples Closer Than the True Match (FOSCTTM) scores are here reported for the six benchmarked methods (scConfluence, Seurat, Liger, MultiMAP, Uniport, and scGLUE) on the four controlled settings derived from the cell lines data as described in (a). Since purity, transfer accuracy, and connectivity scores are based on nearest neighbors graphs, the plots report their behavior for various sizes of neighborhood (x-axis). Error bars in the plots specify the standard deviation across n = 5 random initialization seeds for each method and they are centered on the median result. Inside bar plots, small dark stars represent individual seed results. Source data are provided as a Source Data file; c The six columns of this panel provide UMAP visualizations for the six benchmarked methods (scConfluence, Seurat, Liger, MultiMAP, Uniport, and scGLUE) on the same four controlled settings derived from the cell lines data. Different colors in these UMAP plots correspond to the three different cell lines present in the data while the shape of the point markers corresponds to the modality of origin of each cell (scRNA, scATAC).

**Fig. 3. Cell embedding benchmark in gold-standard scRNA-surface protein and scRNA-scATAC datasets.**
a Schematic representation of the benchmarking process; b Purity, Transfer accuracy, Connectivity, and FOSCTTM scores for the six benchmarked methods (scConfluence, Seurat, Liger, MultiMAP, Uniport, and scGLUE) in two scRNA-scATAC datasets profiled from PBMC and bone marrow. Error bars in the plots specify the standard deviation across n = 5 random initialization seeds for each method and they are centered on the median result. Inside bar plots, small dark stars represent individual seed results. Source data are provided as a Source Data file; c UMAP visualizations of scConfluence’s cell embeddings in the same datasets as (b). Cells are colored based on their modality of origin, their cell type annotation, or their batch of origin (when multiple batches are present in the data), respectively; d Same scores and methods as (b), but computed on the two scRNA-surface protein datasets of the benchmark profiled from bone marrow. Error bars in the plots specify the standard deviation across n = 5 random initialization seeds for each method and they are centered on the median result. Inside bar plots, small dark stars represent individual seed results. Source data are provided as a Source Data file.; e UMAP visualizations of scConfluence’s cell embeddings on the two scRNA-surface protein datasets with cells colored according to the same rules as (c).

**Fig. 4. Cell embeddings and gene imputations resulting from scRNA and smFISH integration in mouse somatosensory cortex.**
a Schematic representation of the integration and imputation process; b Purity, Transfer accuracy, and Connectivity scores of the seven benchmarked methods (scConfluence, Seurat, Liger, MultiMAP, Uniport, and scGLUE, GimVI). Error bars in the plots specify the standard deviation across n = 5 random initialization seeds for each method and they are centered on the median result. Inside bar plots, small dark stars represent individual seed results. Source data are provided as a Source Data file; c UMAP visualizations of scConfluence’s cell embeddings colored by the modalities of origin and their cell type annotations; d Boxplots of average and median Spearman correlation coefficients (aSCC and mSCC) between real and imputed smFISH genes across n = 11 imputation scenarios (no statistical method was used to predetermine sample size). In the boxplots, the center line, box limits, and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. Black dots over the boxplots correspond to individual data points. Source data are provided as a Source Data file; e Spatial pattern of expression of scConfluence’s imputations (bottom) on three held-out smFISH genes and their ground-truth pattern of expression (top). Spearman correlations between the ground-truth and imputed counts are written at the bottom. f scConfluence’s imputed spatial pattern of expression of six scRNA genes not measured in the smFISH experiment.

**Fig. 5. Tri-omics integration and subclustering of PBMC data.**
a Schematic representation of the integration; b UMAP visualization of all the integrated cell embeddings colored by their modality of origin; c–e UMAP visualization of scConfluence’s integrated cell embeddings plotted one modality at a time and colored by their cell type annotation of origin. The red circles highlight B cells which are already sub-annotated in scATAC and CyTOF. The blue circles highlight monocytes that are already sub-annotated in scRNA and CyTOF; f UMAP visualization of all the integrated cell embeddings colored based on inferred cluster annotations. Additional plots are provided for ATAC monocytes and RNA B cells which have been subclustered. The significance of the overlap between the marker genes obtained from scRNA and scATAC for each subcluster (Fisher’s exact test) is plotted. The dashed vertical line corresponds to FDR = 0.01. No alignment significance score is reported for cluster 6 as it only contains cells from the scATAC experiment. Source data are provided as a Source Data file; g–i Sankey diagrams displaying the comparison between cell annotations in their original publication and in our integrative analysis. Source data are provided as a Source Data file.

**Fig. 6. Integration of scRNA-seq and neuronal morphologies in the mouse primary motor cortex.**
a Schematic representation of the integration; b UMAP visualizations of the integrated cell embeddings colored by their modality of origin, their cell type annotations, and their cortical layers of origin; c UMAP visualization of the integrated cell embeddings colored by their morphological labels which are only available for excitatory neurons. The terms ‘tufted’ and ‘untufted’ correspond to visual inspection of the neurons’ apical dendrites; some examples of neuronal morphologies are displayed next to the UMAP plot; d Pattern of expression of *Fezf2* in IT neurons. The boxplots on the left shows the distribution of expression of *Fezf2* in untufted (n = 29) and tufted (n = 31) IT neurons from layer 5. The center line, box limits, and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. Black dots over the boxplots correspond to individual data points. Source data are provided as a Source Data file. The UMAP plot of IT neurons shows the correlated pattern of variation of *Fezf2* expression (corresponding to the size of the points) and the height of apical dendrites (corresponding to the color gradient); e Heatmap representing the depth profiles of Sst neurons’ axons perpendicular to the pia. Cells have been sorted based on the depth of their soma.

See this image and copyright information in PMC

References

1. Method of the Year 2013. Nat. Methods11, 1–1 (2014). - PubMed
1. Potter, S. S. Single-cell RNA sequencing for the study of development, physiology and disease. Nat. Rev. Nephrol.14, 479–492 (2018). 10.1038/s41581-018-0021-7 - DOI - PMC - PubMed
1. Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol.18, 35–45 (2018). 10.1038/nri.2017.76 - DOI - PubMed
1. Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science348, 910–914 (2015). 10.1126/science.aab1601 - DOI - PMC - PubMed
1. Chen, X., Miragaia, R. J., Natarajan, K. N. & Teichmann, S. A. A rapid and robust method for single cell chromatin accessibility profiling. Nat. Commun.9, 5345 (2018). 10.1038/s41467-018-07771-0 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO
Actions
- Search in PubMed
- Search in GEO

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

scConfluence: single-cell diagonal integration with regularized Inverse Optimal Transport on weakly connected features

Affiliations

scConfluence: single-cell diagonal integration with regularized Inverse Optimal Transport on weakly connected features

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources