Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;40(5):703-710.
doi: 10.1038/s41587-021-01161-6. Epub 2022 Jan 20.

scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning

Affiliations

scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning

Yingxin Lin et al. Nat Biotechnol. 2022 May.

Abstract

Single-cell multiomics data continues to grow at an unprecedented pace. Although several methods have demonstrated promising results in integrating several data modalities from the same tissue, the complexity and scale of data compositions present in cell atlases still pose a challenge. Here, we present scJoint, a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semisupervised framework and uses a neural network to simultaneously train labeled and unlabeled data, allowing label transfer and joint visualization in an integrative framework. Using atlas data as well as multimodal datasets generated with ASAP-seq and CITE-seq, we demonstrate that scJoint is computationally efficient and consistently achieves substantially higher cell-type label accuracy than existing methods while providing meaningful joint visualizations. Thus, scJoint overcomes the heterogeneity of different data modalities to enable a more comprehensive understanding of cellular phenotypes.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. overview of scJoint.
a, Overview of scJoint. The input of scJoint consists of one (or several) gene activity score matrix, calculated from the accessibility peak matrix of scATAC-seq, and one (or several) gene expression matrix including cell-type labels from scRNA-seq experiments. The three main steps of the method are illustrated. b, Three data collections analyzed in detail in this study: (1) mouse cell atlases; (2) multimodal data from PBMC; (3) paired data from adult mouse cerebral cortex data generated by SNARE-seq. c, Computation time required by different methods to integrate scRNA-seq and scATAC-seq (top) and their label transfer accuracy (bottom, computed for methods with label transfer functionality). The benchmark datasets were subsampled from 54 cell types in the human fetal atlases,, where the total number of RNA and ATAC cells ranges from 10,000 to 1,089,769. Seurat and Liger were terminated for out-of-memory error on datasets with 500,000 cells and more, and Conos was terminated on the 1 million cell dataset.
Fig. 2 |
Fig. 2 |. Analysis of mouse cell atlas subset data containing 19 overlapping cell types from RNA and ATAC.
a, tSNE visualization of scJoint (left column) and Seurat (right column), colored by cell types defined in Cusanovich et al. (first row) and three protocols (second row). b, Scatter plot of mean silhouette coefficients for scJoint, Liger, Seurat and Conos (left panel), where the x axis shows the mean cell-type silhouette coefficients and the y axis shows 1 – mean modality silhouette coefficients; ideal outcomes would lie in the top right corner. Boxplots of F1 scores of silhouette coefficients for scJoint, Liger, Seurat and Conos (n = 101,692) (right panel). Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range. c, Accuracy rates of scJoint, Seurat and Conos using 20%, 50% and 80% of cells from scRNA-seq data as training data. Ten random subsamplings were performed for each setting to generate the variance. Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range. d, Predicted cell types and their fractions of agreement with the original cell types given in Cusanovich et al. for scJoint (left panel), Seurat (middle panel) and Conos (right panel). Clearer diagonal structure indicates better agreement.
Fig. 3 |
Fig. 3 |. Analysis of mouse cell atlas full data.
a, A 2 × 2 panel of tSNE plots generated from the top 100 dimensions of singular value decomposition of the TF-IDF transformed ATAC-seq data, colored by the original labels (top left), scJoint transferred labels (top right), Seurat transferred labels (bottom left) and Conos transferred labels (bottom right). b, Marker expressions in stromal cells and fibroblasts: Col1a1, Col1a2, Dcn and Ccdc80. The left column shows the gene activity scores of the markers in ATAC-seq data (4,352 stromal cells and 1,602 fibroblasts). The right column shows the log-transformed gene expression of the markers in stromal cells, fibroblasts and endothelial cells versus others; all cells here are taken from the FACS scRNA-seq data (n = 1,363, 2,152, 3,794 and 34,656 for stromal cells, fibroblasts, endothelial cells and others, respectively). Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range. c, tSNE plot of cells originally labeled as ‘unknown’ and annotated by scJoint with probability scores greater than 0.80, colored by predicted cell types (5,931 cells). d, Heatmap of z-scores of average gene activity scores, calculated from cells aggregated by predicted cell types in ATAC. The rows indicate the top four predicted cell types by size. The columns indicate the top differential expressed genes of the corresponding cell type in RNA.
Fig. 4 |
Fig. 4 |. integration of multimodal PBMC data across biological conditions: with (stimulation) or without (control) T cell activation.
a, tSNE visualization of scJoint (first column), Seurat (second column), Conos (third column) and Liger (fourth column) of PBMC data generated from CITE-seq and ASAP-seq, colored by cell type obtained from CiteFuse and manual annotations (first row), technology (second row) and biological condition (third row). b, Barplots of cell-type silhouette coefficients for scJoint, Seurat, Conos and Liger for all cells, colored by cell type. Larger values on the x axis indicate better grouping. c, Scatter plot of mean silhouette coefficients for scJoint, Seurat, Conos and Liger (left), where the x axis denotes the mean cell-type silhouette coefficients, and the y axis denotes 1 – mean modality silhouette coefficients; ideal outcomes would lie in the top right corner. Boxplots of F1 scores of silhouette coefficients for scJoint, Liger, Seurat and Conos (n = 18,088) (right). Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend 1.5 times the interquartile range. d, Heatmaps comparing the original labels and the transferred labels of scJoint, Seurat and Conos. Clearer diagonal structure indicates better agreement. e, tSNE visualization of scJoint colored by the predicted cell types with gene expression levels of CD3D, NKG7, CD8A and CD4 in NK cells.
Fig. 5 |
Fig. 5 |. Analysis of paired gene expression and chromatin accessibility data from SNARe-seq.
a, tSNE visualization of SNARE-seq data for scJoint, Seurat (WNN), MOFA+ and scAI, colored by cell types given in Chen et al.. All unpaired methods treat the RNA and ATAC parts of SNARE-seq as two separate datasets. b, Boxplots of cell-type silhouette coefficients for Seurat (WNN), scAI, MOFA+, scJoint, Seurat, Conos and Liger, colored by method (n = 9,190). Each boxplot ranges from the upper and lower quartiles with the median as the horizontal line and whiskers extend to 1.5 times the interquartile range.

References

    1. Stuart T & Satija R Integrative single-cell analysis. Nat. Rev. Genet 20, 257–272 (2019). - PubMed
    1. Berger SL The complex language of chromatin regulation during transcription. Nature 447, 407–412 (2007). - PubMed
    1. Klemm SL, Shipony Z & Greenleaf WJ Chromatin accessibility and the regulatory epigenome. Nat. Rev. Genet 20, 207–220 (2019). - PubMed
    1. Pott S & Lieb JD Single-cell atac-seq: strength in numbers. Genome Biol. 16, 172 (2015). - PMC - PubMed
    1. Schaum N et al. Single-cell transcriptomics of 20 mouse organs creates a tabula muris: the tabula muris consortium. Nature 562, 367 (2018). - PMC - PubMed

Publication types