Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 2;40(9):btae537.
doi: 10.1093/bioinformatics/btae537.

SCIntRuler: guiding the integration of multiple single-cell RNA-seq datasets with a novel statistical metric

Affiliations

SCIntRuler: guiding the integration of multiple single-cell RNA-seq datasets with a novel statistical metric

Yue Lyu et al. Bioinformatics. .

Abstract

Motivation: The growing number of single-cell RNA-seq (scRNA-seq) studies highlights the potential benefits of integrating multiple datasets, such as augmenting sample sizes and enhancing analytical robustness. Inherent diversity and batch discrepancies within samples or across studies continue to pose significant challenges for computational analyses. Questions persist in practice, lacking definitive answers: Should we use a specific integration method or opt for simply merging the datasets during joint analysis? Among all the existing data integration methods, which one is more suitable in specific scenarios?

Result: To fill the gap, we introduce SCIntRuler, a novel statistical metric for guiding the integration of multiple scRNA-seq datasets. SCIntRuler helps researchers make informed decisions regarding the necessity of data integration and the selection of an appropriate integration method. Our simulations and real data applications demonstrate that SCIntRuler streamlines decision-making processes and facilitates the analysis of diverse scRNA-seq datasets under varying contexts, thereby alleviating the complexities associated with the integration of heterogeneous scRNA-seq datasets.

Availability and implementation: The implementation of our method is available on CRAN as an open-source R package with a user-friendly manual available: https://cloud.r-project.org/web/packages/SCIntRuler/index.html.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
A schematic overview of SCIntRuler workflow. (A) The input includes scRNA-seq data and the associated study information. (B) Cluster each individual scRNA-seq dataset into broad clusters and fine clusters using the Louvain algorithm. (C) Compute within and between (broad) cluster relative distances after randomly selecting fine clusters. (D) Perform a one-sided permutation test to examine the null hypothesis that the broad clusters between studies have significant differences if they were measured in one study. (E) Visualize the results by drawing a scatter plot with within–between cluster relative distances and permutation P-values
Figure 2.
Figure 2.
SCIntRuler plot illustration. SCIntRuler plot interpretation, where each point represents a fine cell cluster. The P-value against the within–between cluster relative distance for all the clusters are plotted, categorized by data sources
Figure 3.
Figure 3.
Simulation settings and results. (A) Simulation settings with cell type number information. (B) Bar plot with error bars showing the mean and three times standard deviation of SCIntRuler based on 50 Monte Carlo datasets under each simulation setting. (C) Visualization of SCIntRuler results for simulation settings 1, 2, 3, and 4
Figure 4.
Figure 4.
Results for real data application. Results of applying SCIntRuler to various real datasets: human brain (A), breast cancer (B), mixed cancer types (C), and primary myelofibrosis pre- and post-treatment (D). The left panel shows SCIntRuler scores and different colors represent different clusters or cell types as identified in each dataset. The right panel marks cells in fine clusters with significant between-group differences (P-values > .9, negative relative distances), indicating the cells share cell type identity across conditions. Selected cells within fine clusters are marked by solid markers, and non-selected cells are represented by empty markers
Figure 5.
Figure 5.
SCIntRuler chart for scRNA-seq data integration method selection. The SCIntRuler value inversely correlates with the amount of shared information between datasets. The less shared information there is between multiple datasets, the larger the SCIntRuler score

References

    1. Argelaguet R, Cuomo ASE, Stegle O. et al. Computational principles and challenges in single-cell data integration. Nat Biotechnol 2021;39:1202–15. - PubMed
    1. Barrett T, Wilhite SE, Ledoux P. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res 2012;41:D991–5. - PMC - PubMed
    1. Blondel VD, Guillaume J-L, Lambiotte R. et al. Fast unfolding of communities in large networks. J Stat Mech 2008;2008:P10008.
    1. Dong X, Leary JR, Yang C. et al. Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference. Brief Bioinform 2024;25:bbae216. - PMC - PubMed
    1. Gawel DR, Serra-Musach J, Lilja S. et al. A validated single-cell-based strategy to identify diagnostic and therapeutic targets in complex diseases. Genome Med 2019;11:47. - PMC - PubMed