LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Laetitia Meng-Papaxanthos¹, Ran Zhang^{2

3}, Gang Li^{2

3}, Marco Cuturi^{4

5}, William Stafford Noble^{2

6}, Jean-Philippe Vert^{4

7}

Affiliations

¹ Google Research, Brain Team, Google, Brandschenkestrasse 110, Zurich 8002, Switzerland.
² Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, United States.
³ eScience Institute, University of Washington, 3910 15th Ave NE, Seattle, WA 98195, United States.
⁴ Google Research, Brain Team, Google, 8 Rue de Londres, Paris 75009, France.
⁵ Apple ML Research, Apple, 7 Av. d'Iéna, Paris 75116, France.
⁶ Paul G. Allen School of Computer Science and Engineering, University of Washington, 185 E Stevens Way NE, Seattle, WA 98195, United States.
⁷ Owkin, Inc., 14/16 Bd Poissonnière, Paris 75009, France.

PMID: 37421399
PMCID: PMC10336029
DOI: 10.1093/bioinformatics/btad420

LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Laetitia Meng-Papaxanthos et al. Bioinformatics. 2023.

. 2023 Jul 1;39(7):btad420.

doi: 10.1093/bioinformatics/btad420.

Authors

Laetitia Meng-Papaxanthos¹, Ran Zhang^{2

3}, Gang Li^{2

3}, Marco Cuturi^{4

5}, William Stafford Noble^{2

6}, Jean-Philippe Vert^{4

7}

Affiliations

¹ Google Research, Brain Team, Google, Brandschenkestrasse 110, Zurich 8002, Switzerland.
² Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, United States.
³ eScience Institute, University of Washington, 3910 15th Ave NE, Seattle, WA 98195, United States.
⁴ Google Research, Brain Team, Google, 8 Rue de Londres, Paris 75009, France.
⁵ Apple ML Research, Apple, 7 Av. d'Iéna, Paris 75116, France.
⁶ Paul G. Allen School of Computer Science and Engineering, University of Washington, 185 E Stevens Way NE, Seattle, WA 98195, United States.
⁷ Owkin, Inc., 14/16 Bd Poissonnière, Paris 75009, France.

PMID: 37421399
PMCID: PMC10336029
DOI: 10.1093/bioinformatics/btad420

Abstract

Motivation: Modality matching in single-cell omics data analysis-i.e. matching cells across datasets collected using different types of genomic assays-has become an important problem, because unifying perspectives across different technologies holds the promise of yielding biological and clinical discoveries. However, single-cell dataset sizes can now reach hundreds of thousands to millions of cells, which remain out of reach for most multimodal computational methods.

Results: We propose LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. In LSMMD-MA, we reformulate the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation in Python. We show that LSMMD-MA scales to a million cells in each modality, two orders of magnitude greater than existing implementations.

Availability and implementation: LSMMD-MA is freely available at https://github.com/google-research/large_scale_mmdma and archived at https://doi.org/10.5281/zenodo.8076311.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
(A) Fastest MMD-MA variant as a function of the number of samples and the number of features. The black region in the top right corner means that all variants ran out of memory. (B) Runtime as a function of number of cells for different implementations of MMD-MA, when the dimension p of the input data varies. The black and green dotted lines with cross markers correspond to the original implementations of MMD-MA as written by Liu *et al.* (2019) (black) and Singh *et al.* (2020) (green). The runtime for different values of p, from 100 to 10 000, is shown in Supplementary Appendix Fig. A1.

See this image and copyright information in PMC

References

1. Abadi M, Agarwal A, Barham P. et al. TensorFlow: large-scale machine learning on heterogeneous systems. [Computer software]. arXiv preprint arXiv:1603.04467, 2016. https://www.tensorflow.org.
1. Cao K, Bai X, Hong Y. et al. Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 2020;36:i48–56. 10.1093/bioinformatics/btaa443. - DOI - PMC - PubMed
1. Cao Z-J, Gao G.. Multi-omics integration and regulatory inference for unpaired single-cell data with a graph-linked unified embedding framework. Nat Biotechnol 2022;40:1458–66. - PMC - PubMed
1. Charlier B, Feydy J, Glaunès JA. et al. Kernel operations on the GPU, with autodiff, without memory overflows. J Mach Learn Res 2021;22:1–6.
1. Gayoso A, Steier Z, Lopez R. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods 2021;18:272–82. 10.1038/s41592-020-01050-x. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Affiliations

LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources