Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 30;36(Suppl_2):i919-i927.
doi: 10.1093/bioinformatics/btaa843.

SCIM: universal single-cell matching with unpaired feature sets

Collaborators, Affiliations

SCIM: universal single-cell matching with unpaired feature sets

Stefan G Stark et al. Bioinformatics. .

Abstract

Motivation: Recent technological advances have led to an increase in the production and availability of single-cell data. The ability to integrate a set of multi-technology measurements would allow the identification of biologically or clinically meaningful observations through the unification of the perspectives afforded by each technology. In most cases, however, profiling technologies consume the used cells and thus pairwise correspondences between datasets are lost. Due to the sheer size single-cell datasets can acquire, scalable algorithms that are able to universally match single-cell measurements carried out in one cell to its corresponding sibling in another technology are needed.

Results: We propose Single-Cell data Integration via Matching (SCIM), a scalable approach to recover such correspondences in two or more technologies. SCIM assumes that cells share a common (low-dimensional) underlying structure and that the underlying cell distribution is approximately constant across technologies. It constructs a technology-invariant latent space using an autoencoder framework with an adversarial objective. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations. We evaluate SCIM on a simulated cellular branching process and show that the cell-to-cell matches derived by SCIM reflect the same pseudotime on the simulated dataset. Moreover, we apply our method to two real-world scenarios, a melanoma tumor sample and a human bone marrow sample, where we pair cells from a scRNA dataset to their sibling cells in a CyTOF dataset achieving 90% and 78% cell-matching accuracy for each one of the samples, respectively.

Availability and implementation: https://github.com/ratschlab/scim.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
SCIM performs a pairwise matching of cell across multiple single-cell ’omics technologies. We assume that the input of each technology comes from the same (or similar) heterogeneous cell mix, depicted on the left. Technologies generate a set of single-cell ’omics datasets (violet polygons) in parallel (e.g. XA, XB, XN). These datasets are represented as matrices of cells-by-features, where features are specific to the profiling technology, but could be gene expression, protein levels, etc. SCIM proceeds to map cells into a technology-invariant latent space (left box) using an autoencoder framework and an adversarial term to keep technologies well integrated. Here, the latent representations capture the underlying structure in the cell mix (colored clouds) and analogous cells from different technologies (colored polygons) are placed in proximity. To integrate datasets, a fast bipartite matching scheme is applied, matching cells pairwise among datasets to cross-technology analogs, using their latent representations (right box)
Fig. 2.
Fig. 2.
Fast bipartite matching using a customized Minimum-Cost Maximum-Flow framework. Nodes correspond to cells with technology represented by shape, i.e. hexagons and decagons. R and S represent root and sink nodes. Edges correspond to the sparse connections between the cells, resulting from a kNN search. Edge labels indicate matching cost (first value) and edge capacity (second value). Many-to-one matches in unbalanced datasets are enabled by increasing the capacities ui (for i1,,m). The null node, colored in gray, captures matches of cells (from the bigger dataset on the left-hand side of the graph) that lack a close enough analog in the other technology. Its capacity equals the cardinality of the bigger dataset and the cost c*0, i.e. null match penalty, is relatively high. The thicker lines linking the nodes represent the actual matches selected by the algorithm.
Fig. 3.
Fig. 3.
Evaluation of cross-technology cell matches made by SCIM on the simulated data. The tree defining the temporal branching process underlying the simulated data is shown on the left. Cells are matched across datasets pairwise using the bipartite matching scheme and the results are depicted on the right hand-side. The Results are shown as a density plot of pseudotime values across matched cells between the source technology (x-axis) and the target technology (y-axis). Cells matched to the same branch label are colored according to the branch-color scheme (accuracy: 86%), while mismatches are depicted in gray and appear mostly in the branching points. Marginal distributions of cell pseudotime for each branch are shown at the bottom (source technology) and left (target technology) of the density plot. We report a correlation of 0.83 (Spearman) and 0.86 (Pearson) for pseudotime label pairs
Fig. 4.
Fig. 4.
Integrated latent space of three synthetic datasets. Three single-cell ’omics datasets (Source, Target A and Target B) are generated (Papadopoulos et al., 2019) from a shared underlying temporal branching process (as defined in Fig. 3). The same branching process was used in all three cases, but the parameters governing their feature distributions are drawn with different seeds. Hence, their latent structure is the same, yet they share no correspondences between features. SCIM is run, fully supervised using the branch label, and all datasets are embedded into a shared latent space. tSNE embeddings (Maaten and Hinton, 2008) are computed and visualized on the combined latent representations from all three datasets. Each column shows only the cells from a single technology. In the top row cells are colored by their branch label, as indicated on the legend. In the bottom row, the cells are colored by their pseudotime, as indicated on the color bar on the right-hand side
Fig. 5.
Fig. 5.
Integrated latent space and matches of scRNA and CyTOF cells from a melanoma sample from the Tumor Profiler Consortium. Discriminators are semi-supervised using 10% of the cell-type labels. Cells are colored by their cell-type label and shaded by their technology (dark shades: CyTOF, light shades: scRNA). Matches produced by SCIM are represented by gray lines connecting cells. tSNE embeddings (Maaten and Hinton, 2008) are computed on the whole dataset and then 10 000 matched pairs are sampled at random for visualization

References

    1. Abadi M. et al. (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
    1. Ahuja R.K. et al. (1993). Network Flows: Theory, Algorithms, and Applications. Prentice-Hall, Inc., USA.
    1. Amodio M. and Krishnaswamy S. 2018. MAGAN: aligning biological manifolds. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Vol. 80. pp. 215–223. July 10th-15th Stockholm,Sweden. http://proceedings.mlr.press/v80/amodio18a.html.
    1. Bandura D.R. et al. (2009) Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal. Chem., 81, 6813–6822. - PubMed
    1. Buenrostro J.D. et al. (2015) Single-cell chromatin accessibility reveals principles of regulatory variation. Nature, 523, 486–490. - PMC - PubMed

Publication types