Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 1;39(1):btac736.
doi: 10.1093/bioinformatics/btac736.

Clustering single-cell multi-omics data with MoClust

Affiliations

Clustering single-cell multi-omics data with MoClust

Musu Yuan et al. Bioinformatics. .

Abstract

Motivation: Single-cell multi-omics sequencing techniques have rapidly developed in the past few years. Clustering analysis with single-cell multi-omics data may give us novel perspectives to dissect cellular heterogeneity. However, multi-omics data have the properties of inherited large dimension, high sparsity and existence of doublets. Moreover, representations of different omics from even the same cell follow diverse distributions. Without proper distribution alignment techniques, clustering methods will encounter less separable clusters easily affected by less informative omics data.

Results: We developed MoClust, a novel joint clustering framework that can be applied to several types of single-cell multi-omics data. A selective automatic doublet detection module that can identify and filter out doublets is introduced in the pretraining stage to improve data quality. Omics-specific autoencoders are introduced to characterize the multi-omics data. A contrastive learning way of distribution alignment is adopted to adaptively fuse omics representations into an omics-invariant representation. This novel way of alignment boosts the compactness and separableness of clusters, while accurately weighting the contribution of each omics to the clustering object. Extensive experiments, over both simulated and real multi-omics datasets, demonstrated the powerful alignment, doublet detection and clustering ability features of MoClust.

Availability and implementation: An implementation of MoClust is available from https://doi.org/10.5281/zenodo.7306504.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Linear fusion may hamper clustering. (a) Without alignment, less informative omics data make the fused clusters less separable. (b) Without alignment, less informative omics data make fused clusters less compact. (c) Without alignment, accurate clusters are attainable with each omics data, but the combination may be of poor quality. (d) Existing one-to-one cluster alignment fails when one of multiple omics is indistinguishable
Fig. 2.
Fig. 2.
Framework overview. (a) Existing multi-omics sequencing methods, grouped by different omics they can sequence. (b) Structure of the MoClust model: (i) preprocessed multi-omics data (Cao and Gao, 2022) are used as input, while outputs are estimated posterior parameters of omics-specific statistical models (Section 2.1). (ii) A fusion layer is introduced to linearly fuse the latent features of different omics data and is guided by a contrastive learning module (Section 2.2). (iii) A Cauchy–Schwarz divergence-based clustering module (Section 2.3) and a novel automatic doublet detection module (Section 2.4) are added after the fusion layer
Fig. 3.
Fig. 3.
MoClust integrates scRNA and protein data. (a and b) Performance of MoClust and competing methods by NMI and ARI over real CITE-seq datasets 10X10k and 10XInhouse. (c) CITE-seq simulation experiments. All simulated data were generated by Splatter, and the performance of each method was evaluated by ARI. (d) The change of fusion weights learned by MoClust when applying on different simulated datasets. (e) Two-dimensional visualization of latent features extracted by MoClust over the 10X10k dataset by the UMAP dimension reduction method. From left to right, fused features, RNA features and protein features are listed and colored by true cell types. (f) UMAP visualization of the fused feature applying MoClust over 10X10k dataset, colored by the expression of different marker proteins
Fig. 4.
Fig. 4.
(a and b) Performance of MoClust and competing methods by NMI and ARI over the real RNA+ATAC multi-omics datasets CellLine and 10XPBMC. (c) The change of fusion weights when clustering on different subgroups of cell types. (d) The estimated numbers of clusters by SIMLR with 10 different random seeds. And the ARI of MoClust using the estimated number of clusters. (e and f) UMAP visualization and Sankey plot of the clustering results performed by MoClust on the 10XInHouse dataset. Doublet detection module is employed in according experiments

References

    1. Angermueller C. et al. (2016) Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods, 13, 229–232. - PMC - PubMed
    1. Argelaguet R. et al. (2018) Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol., 14, e8124. - PMC - PubMed
    1. Argelaguet R. et al. (2020) MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol., 21, 111. - PMC - PubMed
    1. Bian S. et al. (2018) Single-cell multiomics sequencing and analyses of human colorectal cancer. Science, 362, 1060–1063. - PubMed
    1. Cao K. et al. (2020) Unsupervised topological alignment for single-cell multi-omics integration. Bioinformatics, 36, i48–i56. - PMC - PubMed

Publication types