Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark
- PMID: 40731055
- PMCID: PMC12307815
- DOI: 10.1038/s41598-025-12897-5
Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark
Abstract
As the field of single-cell genomics continues to develop, the generation of large-scale scRNA-seq datasets has become more prevalent. Although these datasets offer tremendous potential for shedding light on the complex biology of individual cells, the sheer volume of data presents significant challenges for management and analysis. Off late, to address these challenges, a new discipline, known as "big single-cell data science," has emerged. Within this field, a variety of computational tools have been developed to facilitate the processing and interpretation of scRNA-seq data. However, several of these tools primarily focus on the analytical aspect and tend to overlook the burgeoning data deluge generated by scRNA-seq experiments. In this study, we try to address this challenge and present a novel parallel analytical framework, scSPARKL, that leverages the power of Apache Spark to enable the efficient analysis of single-cell transcriptomic data. scSPARKL is fortified by a rich set of staged algorithms developed to optimize the Apache Spark's work environment. The tool incorporates six key operations for dealing with single-cell Big Data, including data reshaping, data preprocessing, cell/gene filtering, data normalization, dimensionality reduction, and clustering. By utilizing Spark's unlimited scalability, fault tolerance, and parallelism, the tool enables researchers to rapidly and accurately analyze scRNA-seq datasets of any size. We demonstrate the utility of our framework and algorithms through a series of experiments on real-world scRNA-seq data. Overall, our results suggest that scSPARKL represents a powerful and flexible tool for the analysis of single-cell transcriptomic data, with broad applications across the fields of biology and medicine.
Keywords: Apache spark; Big data; Normalization; Quality control; ScRNA-seq.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Competing interests: The authors declare no competing interests.
Figures





Similar articles
-
Reference Vector-guided Evolutionary Algorithm for cluster analysis of single-cell transcriptomes.Comput Methods Programs Biomed. 2025 Sep;269:108873. doi: 10.1016/j.cmpb.2025.108873. Epub 2025 Jun 6. Comput Methods Programs Biomed. 2025. PMID: 40499345
-
DiSC: a statistical tool for fast differential expression analysis of individual-level single-cell RNA-seq data.Bioinformatics. 2025 Jun 2;41(6):btaf327. doi: 10.1093/bioinformatics/btaf327. Bioinformatics. 2025. PMID: 40444783 Free PMC article.
-
Deep learning tackles single-cell analysis-a survey of deep learning for scRNA-seq analysis.Brief Bioinform. 2022 Jan 17;23(1):bbab531. doi: 10.1093/bib/bbab531. Brief Bioinform. 2022. PMID: 34929734 Free PMC article.
-
Scanorama: integrating large and diverse single-cell transcriptomic datasets.Nat Protoc. 2024 Aug;19(8):2283-2297. doi: 10.1038/s41596-024-00991-3. Epub 2024 Jun 6. Nat Protoc. 2024. PMID: 38844552 Free PMC article. Review.
-
Integration tools for scRNA-seq data and spatial transcriptomics sequencing data.Brief Funct Genomics. 2024 Jul 19;23(4):295-302. doi: 10.1093/bfgp/elae002. Brief Funct Genomics. 2024. PMID: 38267084 Review.
References
-
- Wu, A. R., Wang, J., Streets, A. M. & Huang, Y. Single-Cell transcriptional analysis. Annual Rev. Anal. Chem.10 (1), 439–462. 10.1146/annurev-anchem-061516-045228 (2017). - PubMed
-
- Slovin, S. et al. Single-Cell RNA sequencing analysis: A Step-by-Step overview. Methods Mol. Biology (Clifton N J). 2284, 343–365. 10.1007/978-1-0716-1307-8_19 (2021). - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials