Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 16;15(10):982-990.e5.
doi: 10.1016/j.cels.2024.09.003. Epub 2024 Oct 3.

Automated single-cell omics end-to-end framework with data-driven batch inference

Affiliations

Automated single-cell omics end-to-end framework with data-driven batch inference

Yuan Wang et al. Cell Syst. .

Abstract

To facilitate single-cell multi-omics analysis and improve reproducibility, we present single-cell pipeline for end-to-end data integration (SPEEDI), a fully automated end-to-end framework for batch inference, data integration, and cell-type labeling. SPEEDI introduces data-driven batch inference and transforms the often heterogeneous data matrices obtained from different samples into a uniformly annotated and integrated dataset. Without requiring user input, it automatically selects parameters and executes pre-processing, sample integration, and cell-type mapping. It can also perform downstream analyses of differential signals between treatment conditions and gene functional modules. SPEEDI's data-driven batch-inference method works with widely used integration and cell-typing tools. By developing data-driven batch inference, providing full end-to-end automation, and eliminating parameter selection, SPEEDI improves reproducibility and lowers the barrier to obtaining biological insight from these valuable single-cell datasets. The SPEEDI interactive web application can be accessed at https://speedi.princeton.edu/. A record of this paper's transparent peer review process is included in the supplemental information.

Keywords: batch identification; cell-type mapping; information theory; integration; scATAC-seq; scRNA-seq; single-cell genomics.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests S.C.S. is a consultant, equity owner, and interim chief scientific officer at GNOMX Corp. Patents were filed related to this work. O.G.T. is on the advisory board of Cell Systems.

Figures

Figure 1.
Figure 1.. Schematic illustrations of potential batch reporting scenarios and SPEEDI’s solution (see also Table S1).
(A) Different batch effect scenarios. Single-cell data commonly shows non-biological batch effects where the data from groups of samples shows distinct patterns. These batch effects may result from known experimentally recorded factors (which we refer to as experimentally-recorded batches). However, the data often shows non-biological batch effects that are not annotated to any known experimental factor. Whatever the cause of batch effects, they need identification and correction to improve the rigor of detection of true biological effects. The limitations of experimentally-recorded batch identification are illustrated schematically by showing a single-cell dataset of individual samples from 5 subjects comprising three cell types in which batch effects in the data are observed. (Data coming from the same cell type are enclosed within the dashed outlines in the SPEEDI inferred batch panel.) The observed batch effects are not adequately labeled in scenarios in which no batch information is provided (scenario 1) or the experimentally recorded batches do not correctly identify any or all the batch effects present (scenario 2). A data-driven method to provide accurate batch identification, whether due to experimentally-recorded factors or due to unknown factors, is needed. (B) SPEEDI uses an information-theoretic approach to quantify sample distributions and identify local and global batch effects based on learned distributions. The algorithm starts with a low-resolution clustering that separates putative cell types. It then iterates through the clusters to determine if a group of samples is significantly different from the rest of the samples and assigns batch labels locally. In cases where a subset of samples within the significant batch has already been assigned a batch label in the previous iteration, the framework further divides the significant batch by giving the previously unassigned samples a new batch label. This process is repeated until all local batches are stable, which constitutes the final global batch assignment. (C) SPEEDI provides a one-step, fully automated multiple sample single-cell analysis pipeline that does not require any parameter selection by the user and includes a data-driven batch inference method to improve the quality of integration. The framework takes the CellRanger output and implements a workflow comprising quality control with automated parameter selection (Step 1), application of the data-driven batch inference method (Step 2), data integration (Step 3), cell type annotation (Step 4), and optional downstream differential and pathway analyses (Step 5). SPEEDI returns an annotated, integrated data matrix for single-cell RNA-seq and/or for single-cell ATAC-seq, as well as selected analyses.
Figure 2.
Figure 2.. Benchmarking batch labeling strategies (see also Table S2).
In a public atlas-level human lung scRNA-seq dataset comprising data from three different datasets, batches were labeled using either the sample ID, the dataset ID, or the SPEEDI data-driven method. The three types of batch labels were used as input for four different batch correction and data integration packages (Harmony, Seurat CCA, Seurat RPCA, Scanorama). The results of the integration with different batch labeling approaches were compared using metrics for cell type coherence and for batch removal. (A) Evaluation of SPEEDI batch inference in preserving biological variation among distinct cell types after integration. Each dot represents a sample score for a total of 16 samples. Each score quantifies the effectiveness in preserving the integrity of different cell types within that sample. A higher score indicates that the cells of a particular cell type are more distinctly separated from cells of other types after integration. Pairwise nonparametric Wilcoxon rank sum tests were performed and p-values were Bonferroni-adjusted. (B) Evaluation of SPEEDI batch inference in eliminating batch variants among samples after integration. East dot represents a cell type for a total of 17 types. Each score represents how effectively batches are mitigated within the associated cell type. A higher score indicates that cells of the same type, regardless of their sample origin, are better inter-mixed after integration. Pairwise nonparametric Wilcoxon rank sum tests were performed and p-values were Bonferroni-adjusted.
Figure 3.
Figure 3.. SPEEDI facilitates integrative cell type identification from scRNA-seq data. (See also Figures S1–5)
(A) We applied SPEEDI to a human peripheral blood mononuclear cell (PBMC) scRNA-seq study with 20 subjects. Batches were identified using either the automated data-driven batch inference method implemented in SPEEDI (n = 12) or using sample IDs (n = 20). The UMAPs show the T/NK cell population before integration (see Figure S1 for UMAPs with all major PBMC cell types). (B) The UMAPs of the T/NK cell population after integration with both batch labeling strategies are shown (see Figure S2 for UMAPs with all major PBMC cell types). (C) Correspondence between the data-defined batch labels and sample IDs. (D) Score for biology preservation. For cell type coherence measures, each dot represents a sample score, and each score quantifies the effectiveness in preserving the integrity of different cell types within that sample. Pairwise nonparametric Wilcoxon rank sum tests were performed.Scores for batch removal. For batch effect removal metrics, each dot represents a cell type score, and each score represents how effectively batches are mitigated within the associated cell type. Pairwise nonparametric Wilcoxon rank sum tests were performed. *p<0.05, *** p<0.001, n.s. Not-significant (p>0.05). Bonferroni corrected t-test.
Figure 4.
Figure 4.. SPEEDI batch inference and framework are highly robust on multiome mouse data. (See also Table S3)
(A) Same-cell scRNA-seq and scATAC-seq multiome datasets were generated from 14 wild-type female murine pituitaries. Data from each assay were integrated and annotated by the SPEEDI framework. (B) Heatmap representation of the contingency table that compares the annotation of individual cells by cell type for the scRNA-seq and for the scATAC-seq data after using the SPEEDI pipeline. The rows represent the cell type annotation for scATAC-seq, and the columns represent those for scRNA-seq. Each cell ranges between 0 and 1, where the value indicates the percentage of overlapping barcodes (median cell subtype identification overlap = 0.96).

Update of

References

    1. Haghverdi L, Lun ATL, Morgan MD, & Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology, 36(5), 421–427. 10.1038/nbt.4091 - DOI - PMC - PubMed
    1. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, Smibert P, & Satija R (2019). Comprehensive Integration of Single-Cell Data. Cell, 177(7), 1888–1902.e21. 10.1016/j.cell.2019.05.031 - DOI - PMC - PubMed
    1. Hie B, Bryson B, & Berger B (2019). Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nature biotechnology, 37(6), 685–691. 10.1038/s41587-019-0113-3 - DOI - PMC - PubMed
    1. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh PR, & Raychaudhuri S (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods, 16(12), 1289–1296. 10.1038/s41592-019-0619-0 - DOI - PMC - PubMed
    1. Luecken MD, Büttner M, Chaichoompu K, Danese A, Interlandi M, Mueller MF, Strobl DC, Zappia L, Dugas M, Colomé-Tatché M, & Theis FJ (2022). Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1), 41–50. 10.1038/s41592-021-01336-8 - DOI - PMC - PubMed