. 2020 Dec 30;36(Suppl_2):i919-i927.

doi: 10.1093/bioinformatics/btaa843.

SCIM: universal single-cell matching with unpaired feature sets

Stefan G Stark^{1

2

3}, Joanna Ficek^{1

2

3

4}, Francesco Locatello^{1

5

6}, Ximena Bonilla^{1

2

3}, Stéphane Chevrier⁷, Franziska Singer^{2

8}; Tumor Profiler Consortium; Gunnar Rätsch^{1

2

3

6

9}, Kjong-Van Lehmann^{1

2

3}

Collaborators, Affiliations

Collaborators

Tumor Profiler Consortium:
Rudolf Aebersold, Faisal S Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Marina Bacac, Daniel Baumhoer, Beatrice Beck-Schimmer, Niko Beerenwinkel, Christian Beisel, Lara Bernasconi, Anne Bertolini, Bernd Bodenmiller, Ximena Bonilla, Ruben Casanova, Stéphane Chevrier, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Dră Gan, Reinhard Dummer, Stefanie Engler, Martin Erkens, Katja Eschbach, Cinzia Esposito, André Fedier, Pedro Ferreira, Joanna Ficek, Anja L Frei, Bruno Frey, Sandra Goetze, Linda Grob, Gabriele Gut, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Viola Heinzelmann-Schwarz, Sylvia Herter, Rene Holtackers, Tamara Huesser, Anja Irmisch, Francis Jacob, Andrea Jacobs, Tim M Jaeger, Katharina Jahn, Alva R James, Philip M Jermann, André Kahles, Abdullah Kahraman, Viktor H Koelzer, Werner Kuebler, Jack Kuipers, Christian P Kunze, Christian Kurzeder, Kjong-Van Lehmann, Mitchell Levesque, Sebastian Lugert, Gerd Maass, Markus Manz, Philipp Markolin, Julien Mena, Ulrike Menzel, Julian M Metzler, Nicola Miglino, Emanuela S Milani, Holger Moch, Simone Muenst, Riccardo Murri, Charlotte Ky Ng, Stefan Nicolet, Marta Nowak, Patrick Ga Pedrioli, Lucas Pelkmans, Salvatore Piscuoglio, Michael Prummer, Mathilde Ritter, Christian Rommel, María L Rosano-González, Gunnar Rätsch, Natascha Santacroce, Jacobo Sarabia Del Castillo, Ramona Schlenker, Petra C Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Franziska Singer, Sujana Sivapatham, Berend Snijder, Bettina Sobottka, Vipin T Sreedharan, Stefan Stark, Daniel J Stekhoven, Alexandre Pa Theocharides, Tinu M Thomas, Markus Tolnay, Vinko Tosevski, Nora C Toussaint, Mustafa A Tuncel, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P Weber, Rebekka Wegmann, Michael Weller, Fabian Wendt, Norbert Wey, Andreas Wicki, Bernd Wollscheid, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend

Affiliations

¹ Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland.
² Swiss Institute of Bioinformatics, Quartier Sorge Bâtiment Amphipôle, 1015 Lausanne, Switzerland.
³ Life Science Zurich Graduate School, PhD Program Molecular & Translational Biomedicine, 8057 Zürich, Switzerland.
⁴ Max Planck Institute for Intelligent Systems, Empirical Inference Department, 72076 Tübingen, Germany.
⁵ Center for Learning Systems, ETH Zürich, 8092 Zürich, Switzerland.
⁶ Department of Quantitative Biomedicine, University of Zürich, 8057 Zürich, Switzerland.
⁷ University Hospital Zürich, 8091 Zürich, Switzerl.
⁸ University Hospital Zürich, 8091 Zürich Switzerland.
⁹ Department of Biology, ETH Zürich, 8093 Zürich, Switzerland.

PMID: 33381818
PMCID: PMC7773480
DOI: 10.1093/bioinformatics/btaa843

SCIM: universal single-cell matching with unpaired feature sets

Stefan G Stark et al. Bioinformatics. 2020.

. 2020 Dec 30;36(Suppl_2):i919-i927.

doi: 10.1093/bioinformatics/btaa843.

Authors

Collaborators

Tumor Profiler Consortium:
Rudolf Aebersold, Faisal S Al-Quaddoomi, Jonas Albinus, Ilaria Alborelli, Sonali Andani, Per-Olof Attinger, Marina Bacac, Daniel Baumhoer, Beatrice Beck-Schimmer, Niko Beerenwinkel, Christian Beisel, Lara Bernasconi, Anne Bertolini, Bernd Bodenmiller, Ximena Bonilla, Ruben Casanova, Stéphane Chevrier, Natalia Chicherova, Maya D'Costa, Esther Danenberg, Natalie Davidson, Monica-Andreea Dră Gan, Reinhard Dummer, Stefanie Engler, Martin Erkens, Katja Eschbach, Cinzia Esposito, André Fedier, Pedro Ferreira, Joanna Ficek, Anja L Frei, Bruno Frey, Sandra Goetze, Linda Grob, Gabriele Gut, Detlef Günther, Martina Haberecker, Pirmin Haeuptle, Viola Heinzelmann-Schwarz, Sylvia Herter, Rene Holtackers, Tamara Huesser, Anja Irmisch, Francis Jacob, Andrea Jacobs, Tim M Jaeger, Katharina Jahn, Alva R James, Philip M Jermann, André Kahles, Abdullah Kahraman, Viktor H Koelzer, Werner Kuebler, Jack Kuipers, Christian P Kunze, Christian Kurzeder, Kjong-Van Lehmann, Mitchell Levesque, Sebastian Lugert, Gerd Maass, Markus Manz, Philipp Markolin, Julien Mena, Ulrike Menzel, Julian M Metzler, Nicola Miglino, Emanuela S Milani, Holger Moch, Simone Muenst, Riccardo Murri, Charlotte Ky Ng, Stefan Nicolet, Marta Nowak, Patrick Ga Pedrioli, Lucas Pelkmans, Salvatore Piscuoglio, Michael Prummer, Mathilde Ritter, Christian Rommel, María L Rosano-González, Gunnar Rätsch, Natascha Santacroce, Jacobo Sarabia Del Castillo, Ramona Schlenker, Petra C Schwalie, Severin Schwan, Tobias Schär, Gabriela Senti, Franziska Singer, Sujana Sivapatham, Berend Snijder, Bettina Sobottka, Vipin T Sreedharan, Stefan Stark, Daniel J Stekhoven, Alexandre Pa Theocharides, Tinu M Thomas, Markus Tolnay, Vinko Tosevski, Nora C Toussaint, Mustafa A Tuncel, Marina Tusup, Audrey Van Drogen, Marcus Vetter, Tatjana Vlajnic, Sandra Weber, Walter P Weber, Rebekka Wegmann, Michael Weller, Fabian Wendt, Norbert Wey, Andreas Wicki, Bernd Wollscheid, Shuqing Yu, Johanna Ziegler, Marc Zimmermann, Martin Zoche, Gregor Zuend

Affiliations

¹ Department of Computer Science, ETH Zürich, 8092 Zürich, Switzerland.
² Swiss Institute of Bioinformatics, Quartier Sorge Bâtiment Amphipôle, 1015 Lausanne, Switzerland.
³ Life Science Zurich Graduate School, PhD Program Molecular & Translational Biomedicine, 8057 Zürich, Switzerland.
⁴ Max Planck Institute for Intelligent Systems, Empirical Inference Department, 72076 Tübingen, Germany.
⁵ Center for Learning Systems, ETH Zürich, 8092 Zürich, Switzerland.
⁶ Department of Quantitative Biomedicine, University of Zürich, 8057 Zürich, Switzerland.
⁷ University Hospital Zürich, 8091 Zürich, Switzerl.
⁸ University Hospital Zürich, 8091 Zürich Switzerland.
⁹ Department of Biology, ETH Zürich, 8093 Zürich, Switzerland.

PMID: 33381818
PMCID: PMC7773480
DOI: 10.1093/bioinformatics/btaa843

Abstract

Motivation: Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed.

Results: We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively.

Availability and implementation: https://github.com/ratschlab/scim.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
SCIM performs a pairwise matching of cell across multiple single-cell ’omics technologies. We assume that the input of each technology comes from the same (or similar) heterogeneous cell mix, depicted on the left. Technologies generate a set of single-cell ’omics datasets (violet polygons) in parallel (e.g. *X_A*, *X_B*, *X_N*). These datasets are represented as matrices of cells-by-features, where features are specific to the profiling technology, but could be gene expression, protein levels, etc. SCIM proceeds to map cells into a technology-invariant latent space (left box) using an autoencoder framework and an adversarial term to keep technologies well integrated. Here, the latent representations capture the underlying structure in the cell mix (colored clouds) and analogous cells from different technologies (colored polygons) are placed in proximity. To integrate datasets, a fast bipartite matching scheme is applied, matching cells pairwise among datasets to cross-technology analogs, using their latent representations (right box)

**Fig. 2.**
Fast bipartite matching using a customized Minimum-Cost Maximum-Flow framework. Nodes correspond to cells with technology represented by shape, i.e. hexagons and decagons. R and S represent root and sink nodes. Edges correspond to the sparse connections between the cells, resulting from a kNN search. Edge labels indicate matching cost (first value) and edge capacity (second value). Many-to-one matches in unbalanced datasets are enabled by increasing the capacities *u_i* (for $i \in 1, \dots, m$ ). The *null* node, colored in gray, captures matches of cells (from the bigger dataset on the left-hand side of the graph) that lack a close enough analog in the other technology. Its capacity equals the cardinality of the bigger dataset and the cost $c_{* 0}$ , i.e. null match penalty, is relatively high. The thicker lines linking the nodes represent the actual matches selected by the algorithm.

**Fig. 3.**
Evaluation of cross-technology cell matches made by SCIM on the simulated data. The tree defining the temporal branching process underlying the simulated data is shown on the left. Cells are matched across datasets pairwise using the bipartite matching scheme and the results are depicted on the right hand-side. The Results are shown as a density plot of pseudotime values across matched cells between the source technology (x-axis) and the target technology (y-axis). Cells matched to the same branch label are colored according to the branch-color scheme (accuracy: 86%), while mismatches are depicted in gray and appear mostly in the branching points. Marginal distributions of cell pseudotime for each branch are shown at the bottom (source technology) and left (target technology) of the density plot. We report a correlation of 0.83 (Spearman) and 0.86 (Pearson) for pseudotime label pairs

**Fig. 4.**
Integrated latent space of three synthetic datasets. Three single-cell ’omics datasets (Source, Target A and Target B) are generated (Papadopoulos *et al.*, 2019) from a shared underlying temporal branching process (as defined in Fig. 3). The same branching process was used in all three cases, but the parameters governing their feature distributions are drawn with different seeds. Hence, their latent structure is the same, yet they share no correspondences between features. SCIM is run, fully supervised using the branch label, and all datasets are embedded into a shared latent space. tSNE embeddings (Maaten and Hinton, 2008) are computed and visualized on the combined latent representations from all three datasets. Each column shows only the cells from a single technology. In the top row cells are colored by their branch label, as indicated on the legend. In the bottom row, the cells are colored by their pseudotime, as indicated on the color bar on the right-hand side

**Fig. 5.**
Integrated latent space and matches of scRNA and CyTOF cells from a melanoma sample from the Tumor Profiler Consortium. Discriminators are semi-supervised using 10% of the cell-type labels. Cells are colored by their cell-type label and shaded by their technology (dark shades: CyTOF, light shades: scRNA). Matches produced by SCIM are represented by gray lines connecting cells. tSNE embeddings (Maaten and Hinton, 2008) are computed on the whole dataset and then 10 000 matched pairs are sampled at random for visualization

See this image and copyright information in PMC

References

1. Abadi M. et al. (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
1. Ahuja R.K. et al. (1993). Network Flows: Theory, Algorithms, and Applications. Prentice-Hall, Inc., USA.
1. Amodio M. and Krishnaswamy S. 2018. MAGAN: aligning biological manifolds. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Vol. 80. pp. 215–223. July 10th-15th Stockholm,Sweden. http://proceedings.mlr.press/v80/amodio18a.html.
1. Bandura D.R. et al. (2009) Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem., 81, 6813–6822. - PubMed
1. Buenrostro J.D. et al. (2015) Single-cell chromatin accessibility reveals principles of regulatory variation. Nature, 523, 486–490. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SCIM: universal single-cell matching with unpaired feature sets

Collaborators

Affiliations

SCIM: universal single-cell matching with unpaired feature sets

Authors

Collaborators

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources