Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 13;25(10):105123.
doi: 10.1016/j.isci.2022.105123. eCollection 2022 Oct 21.

Comprehensive multi-omics single-cell data integration reveals greater heterogeneity in the human immune system

Affiliations

Comprehensive multi-omics single-cell data integration reveals greater heterogeneity in the human immune system

Congmin Xu et al. iScience. .

Abstract

Single-cell transcriptomics enables the definition of diverse human immune cell types across multiple tissues and disease contexts. Further deeper biological understanding requires comprehensive integration of multiple single-cell omics (transcriptomic, proteomic, and cell-receptor repertoire). To improve the identification of diverse cell types and the accuracy of cell-type classification in multi-omics single-cell datasets, we developed SuPERR, a novel analysis workflow to increase the resolution and accuracy of clustering and allow for the discovery of previously hidden cell subsets. In addition, SuPERR accurately removes cell doublets and prevents widespread cell-type misclassification by incorporating information from cell-surface proteins and immunoglobulin transcript counts. This approach uniquely improves the identification of heterogeneous cell types and states in the human immune system, including rare subsets of antibody-secreting cells in the bone marrow.

Keywords: Biocomputational method; Omics; Systems biology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
SuPERR workflow (A) Schematic overview of the experimental design. Peripheral blood and bone marrow aspirates were processed, surface-stained with barcoded antibodies, and then encapsulated with barcoded microspheres. We generated three libraries for each sample corresponding to gene expression (GEX), cell-surface protein/antibody-derived tags (ADT), and cell-receptor repertoire (VDJ). Libraries were sequenced to a target depth, and count matrices were assembled for each-omic data separately. (B) SuPERR workflow is composed of two main steps. Major cell lineages are manually gated at the first step by integrating information from both the ADT and V(D)J data matrices. Then, the manually-gated cell lineages are further sub-clustered based on information from the GEX data. The V(D)J matrix can be used to further identify the diversity of heavy (VH) and light (VL) variable genes among the plasma cell clusters. PCs: plasma cells. See also Tables S1 and S2.
Figure 2
Figure 2
SuPERR workflow applied to peripheral blood mononuclear cells (PBMCs) (A) “Gating strategy” approach to identify major cell lineages on biaxial plots based on surface markers (ADT) and V(D)J data. Total Ig transcript: sum of Ig UMIs in the VDJ matrix. Gates for major lineages are indicated as black outlines and black text. Gates for downstream cell-identity validation are indicated as golden outlines and golden text. (B) Cross comparison between the manually-gated major lineages and the final SuPERR clusters. (C) The average expression levels of surface markers (ADT) and VDJ features for the final SuPERR clusters. Only the ADTs/VDJ features that were not used for sequential gating are included. All gates: all cell types defined by sequential gating. SUPERR clusters: clusters generated by clustering on each major cell types. PCs: plasma cells. See also Figures S1 and S3.
Figure 3
Figure 3
SuPERR workflow applied to bone marrow (BM) cells (A) “Gating strategy” approach to identify major cell lineages on biaxial plots based on surface markers (ADT) and V(D)J data. Total Ig transcript: sum of Ig UMIs in the VDJ matrix. Gates for major lineages are indicated as black outlines and black text. Gates for downstream cell-identity validation are indicated as golden outlines and golden text. (B) Cross comparison between the manually-gated major lineages and the final SuPERR clusters. (C) The average expression levels of surface markers (ADT) and VDJ features for the final SuPERR clusters. Only the ADTs/VDJ features that were not used for sequential gating are included. Non-productive: 1, if a cell were labeled as non-productive in the VDJ matrix and 0 if not. Productive VDJ: 1, if a cell was labeled as productive in the VDJ matrix and 0 if not. All gates: all cell types defined by sequential gating. SUPERR clusters: clusters generated by clustering on each major cell types. PCs: plasma cells. See also Figures S2 and S3.
Figure 4
Figure 4
Cell-type-specific variations in gene expression (A) “Gene count” represents the number of unique genes expressed by each cell type. Error bars in boxplots are the 95% confidence interval. (B) “UMI count” represents the total mRNA abundance expressed by each cell type. (C) “Percent of Ribosomal” represents the percentage of ribosomal gene UMI counts expressed by each cell type. The grey line shows the mean expression level for each feature in the total PBMC and BM samples. (D) Left panel: the top 30 (red points) and the top 300 (grey points) highly variable genes (HVGs) from total PBMC. The points under the red dashed line fall below the top 300 HVGs of total PBMC. Right panel: the top 30 HVGs from PBMC-derived B cells (green points) displayed with the top 300 HVGs from total PBMC (grey points). Student’s t-test was used to compare the mean of each cell type with the mean of the total PBMC/BM. ∗p<0.05, ∗∗p < 0.01, ∗∗∗p < 0.001, ∗∗∗∗p < 0.0001, unpaired, two-tailed. Multiple-group ANOVA test for (A), (B), and (C): p < 2.2e-16. PCs: plasma cells.
Figure 5
Figure 5
SuPERR workflow identifies four subsets of human plasma cells in the BM (A) UMAP representation of the four bone marrow (BM) plasma cell clusters. (B) Top panel: percentage of Ig-specific transcripts (UMI) expressed in each plasma cell subset. Bottom panel: expression levels (sum) of plasma cell genes (see STAR Methods) after removing Ig-specific UMIs and re-normalizing the data matrix. Error bars in boxplots are the 95% confidence interval. (C) Expression levels of individual plasma cell genes, cell-cycle score after removing Ig-specific UMIs (See STAR Methods) and ADT. The grey line shows the mean expression level across all clusters. (D) The antibody isotypes and subclasses expressed by each plasma cell subset. (E) The connected lines on the Circus plot describe shared clones between clusters (clonal lineage was identified by the identical V and J gene usage, identical CDR3 nucleotide length, and ≥85% homology within the CDR3 nucleotide sequence). (F) Reactome Pathway Database analysis (see STAR Methods) shows unique biological processes that define each plasma cell subset.
Figure 6
Figure 6
Cell-doublet identification by SuPERR using both surface markers and gene expression data matrices (A) Distribution (gene count x total UMI) of cell doublets (left) and singlets (right) detected by the SuPERR approach. The red dashed lines show the threshold used by some conventional approaches to exclude cells that express higher than mean+4SD of gene count and total UMI. Only the cells above the dashed lines would have been excluded from the downstream analysis in conventional approaches (i.e., plasma cells in PBMCs, highlighted in the red circle, would have been incorrectly excluded from downstream analysis). (B) The number of unique genes (left panels) and the number of total UMIs (right panels) expressed by singlets and doublets in PBMC (top panels) and BM (bottom panels). The grey line shows the mean expression level across all clusters. Error bars in boxplots are the 95% confidence interval. (C) Cell doublets identified by the SuPERR workflow and projected on a UMAP, showing the cell doublets are spread across multiple clusters. (D) Venn diagram comparing the cell doublets identified by the SuPERR workflow and the ScDblFinder pipelines. (E) Proportion of heterotypic doublets identified and classified by SuPERR in PBMC. (F) Expression level of gene signatures (see STAR Methods) of heterotypic doublets defined by SuPERR and scDblFinder to confirm their cell identities. Red points represent SuPERR-defined doublets. Green points are the cell doublets identified by both SuPERR and scDblFinder. Blue points represent scDblFinder-defined doublets, which were identified as singlets by SuPERR. The immune cell types were annotated by the SuPERR workflow. See also Figures S1, S2, and S12.
Figure 7
Figure 7
SuPERR identifies significant cell-type misclassifications in other commonly-used approaches (A) Red points represent the peripheral blood mononuclear cells (PBMC) that were misclassified by either the conventional approach using GEX data only (i.e., Seurat v3), or by more recent approaches using both GEX and ADT data, such as the WNN in Seurat v4, and the SNF in CiteFuse. The Cell Fidelity Statistic (CFS, see STAR Methods) reports the fraction of correctly classified cells, the inverse of which is the fraction of misclassified cells (6.94% by Seurat v3, 5.16% by Seurat v4, 2.42% by CiteFuse). (B) Red points represent the bone marrow (BM) cells that were misclassified by Seurat v3 (5.31%), WNN/Seurat v4 (5.12%), and SNF/CiteFuse (5.15%) as determined by CFS. CFS scores show a progressive improvement in cell-type classification from Seurat v3 (GEX only) to Seurat v4 and CiteFuse, revealing higher agreement between CiteFuse and gold-standard biaxial gating of cell lineages. (C) The PBMC cluster 4 generated by the WNN method (Seurat v4) contains misclassified cells (i.e., a mixture of NK, NKT, and T cells) and was further explored using the cell-surface (ADT) markers CD56 and CD3 (left panel). The Differential Gene Expression (DGE) analysis for cluster 4 (pink circle) compared to “cleaned” NK cells (Venn diagram) shows the TRGV10 gene as a top hit. However, the TRGV10 gene is mostly expressed in CD3+ gamma-delta T cells and absent in NK cells (right panel). See also Figures S14–S17.

References

    1. Aliseychik M., Patrikeev A., Gusev F., Grigorenko A., Andreeva T., Biragyn A., Rogaev E. Dissection of the human T-cell receptor γ gene repertoire in the brain and peripheral blood identifies age- and alzheimer's disease-associated clonotype profiles. Front. Immunol. 2020;11:12. - PMC - PubMed
    1. Altman N., Krzywinski M. The curse(s) of dimensionality. Nat. Methods. 2018;15:399–400. - PubMed
    1. Aran D., Looney A.P., Liu L., Wu E., Fong V., Hsu A., Chak S., Naikawadi R.P., Wolters P.J., Abate A.R., et al. 'Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 2019;20:163–172. - PMC - PubMed
    1. Argelaguet R., Velten B., Arnol D., Dietrich S., Zenz T., Marioni J.C., Buettner F., Huber W., Stegle O. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 2018;14:e8124. - PMC - PubMed
    1. Babcock B.R., Kosters A., Yang J., White L., Ghosn E.B. Data matrix normalization and merging strategies minimize batch-specific systemic variation in scRNA-seq data. bioRxiv. 2021 doi: 10.1101/2021.08.18.456898. Preprint at. - DOI