Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 1;39(7):btad420.
doi: 10.1093/bioinformatics/btad420.

LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Affiliations

LSMMD-MA: scaling multimodal data integration for single-cell genomics data analysis

Laetitia Meng-Papaxanthos et al. Bioinformatics. .

Abstract

Motivation: Modality matching in single-cell omics data analysis-i.e. matching cells across datasets collected using different types of genomic assays-has become an important problem, because unifying perspectives across different technologies holds the promise of yielding biological and clinical discoveries. However, single-cell dataset sizes can now reach hundreds of thousands to millions of cells, which remain out of reach for most multimodal computational methods.

Results: We propose LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. In LSMMD-MA, we reformulate the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation in Python. We show that LSMMD-MA scales to a million cells in each modality, two orders of magnitude greater than existing implementations.

Availability and implementation: LSMMD-MA is freely available at https://github.com/google-research/large_scale_mmdma and archived at https://doi.org/10.5281/zenodo.8076311.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) Fastest MMD-MA variant as a function of the number of samples and the number of features. The black region in the top right corner means that all variants ran out of memory. (B) Runtime as a function of number of cells for different implementations of MMD-MA, when the dimension p of the input data varies. The black and green dotted lines with cross markers correspond to the original implementations of MMD-MA as written by  Liu et al. (2019) (black) and Singh et al. (2020) (green). The runtime for different values of p, from 100 to 10 000, is shown in Supplementary Appendix Fig. A1.

References

    1. Abadi M, Agarwal A, Barham P. et al. TensorFlow: large-scale machine learning on heterogeneous systems. [Computer software]. arXiv preprint arXiv:1603.04467, 2016. https://www.tensorflow.org.
    1. Cao K, Bai X, Hong Y. et al. Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics 2020;36:i48–56. 10.1093/bioinformatics/btaa443. - DOI - PMC - PubMed
    1. Cao Z-J, Gao G.. Multi-omics integration and regulatory inference for unpaired single-cell data with a graph-linked unified embedding framework. Nat Biotechnol 2022;40:1458–66. - PMC - PubMed
    1. Charlier B, Feydy J, Glaunès JA. et al. Kernel operations on the GPU, with autodiff, without memory overflows. J Mach Learn Res 2021;22:1–6.
    1. Gayoso A, Steier Z, Lopez R. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods 2021;18:272–82. 10.1038/s41592-020-01050-x. - DOI - PMC - PubMed

Publication types