Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark

Asif Adil^{1

2

3}, Namrata Bhattacharya^{4

5}, Aadam⁶, Naveed Jeelani Khan⁷, Mohammed Asger⁸

Affiliations

¹ Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, India. asifadil@bgsbu.ac.in.
² Department of Pathology and Laboratory Medicine, School of Medicine, Indiana University Indianapolis, Indianapolis, IN, USA. asifadil@bgsbu.ac.in.
³ Department of Pathology and Laboratory Medicine, Indiana University Indianapolis, Indianapolis, IN, USA. asifadil@bgsbu.ac.in.
⁴ Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, New Delhi, India.
⁵ Australian Prostate Cancer Research Center, Queensland University of Technology, Brisbane, Australia.
⁶ Department of Computer Science, Luddy School of Informatics, Indiana University Indianapolis, Indianapolis, IN, USA.
⁷ Department of Computer Science and Engineering, Model Institute of Engineering and Technology, Jammu, Jammu and Kashmir, India. naveed.cse@mietjammu.in.
⁸ Department of Computer Science and Engineering, Model Institute of Engineering and Technology, Jammu, Jammu and Kashmir, India. asger.cse@mietjammu.in.

PMID: 40731055
PMCID: PMC12307815
DOI: 10.1038/s41598-025-12897-5

Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark

Asif Adil et al. Sci Rep. 2025.

. 2025 Jul 29;15(1):27713.

doi: 10.1038/s41598-025-12897-5.

Authors

Asif Adil^{1

2

3}, Namrata Bhattacharya^{4

5}, Aadam⁶, Naveed Jeelani Khan⁷, Mohammed Asger⁸

Affiliations

¹ Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, India. asifadil@bgsbu.ac.in.
² Department of Pathology and Laboratory Medicine, School of Medicine, Indiana University Indianapolis, Indianapolis, IN, USA. asifadil@bgsbu.ac.in.
³ Department of Pathology and Laboratory Medicine, Indiana University Indianapolis, Indianapolis, IN, USA. asifadil@bgsbu.ac.in.
⁴ Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, New Delhi, India.
⁵ Australian Prostate Cancer Research Center, Queensland University of Technology, Brisbane, Australia.
⁶ Department of Computer Science, Luddy School of Informatics, Indiana University Indianapolis, Indianapolis, IN, USA.
⁷ Department of Computer Science and Engineering, Model Institute of Engineering and Technology, Jammu, Jammu and Kashmir, India. naveed.cse@mietjammu.in.
⁸ Department of Computer Science and Engineering, Model Institute of Engineering and Technology, Jammu, Jammu and Kashmir, India. asger.cse@mietjammu.in.

PMID: 40731055
PMCID: PMC12307815
DOI: 10.1038/s41598-025-12897-5

Abstract

As the field of single-cell genomics continues to develop, the generation of large-scale scRNA-seq datasets has become more prevalent. Although these datasets offer tremendous potential for shedding light on the complex biology of individual cells, the sheer volume of data presents significant challenges for management and analysis. Off late, to address these challenges, a new discipline, known as "big single-cell data science," has emerged. Within this field, a variety of computational tools have been developed to facilitate the processing and interpretation of scRNA-seq data. However, several of these tools primarily focus on the analytical aspect and tend to overlook the burgeoning data deluge generated by scRNA-seq experiments. In this study, we try to address this challenge and present a novel parallel analytical framework, scSPARKL, that leverages the power of Apache Spark to enable the efficient analysis of single-cell transcriptomic data. scSPARKL is fortified by a rich set of staged algorithms developed to optimize the Apache Spark's work environment. The tool incorporates six key operations for dealing with single-cell Big Data, including data reshaping, data preprocessing, cell/gene filtering, data normalization, dimensionality reduction, and clustering. By utilizing Spark's unlimited scalability, fault tolerance, and parallelism, the tool enables researchers to rapidly and accurately analyze scRNA-seq datasets of any size. We demonstrate the utility of our framework and algorithms through a series of experiments on real-world scRNA-seq data. Overall, our results suggest that scSPARKL represents a powerful and flexible tool for the analysis of single-cell transcriptomic data, with broad applications across the fields of biology and medicine.

Keywords: Apache spark; Big data; Normalization; Quality control; ScRNA-seq.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Apache Spark workflow. The Spark driver program works as a master and as an entry point for all the Spark jobs. The master submits jobs to the worker nodes. The cluster manager keeps the track of the nodes, and the jobs distributed to them, several cluster managers are Yet Another Resource Negotiator (YARN), Kubernettes, mesos and standalone (in our case). The worker/slave nodes are the actual machines where the tasks are executed, and they report back to the cluster manager.

**Fig. 2**
General workflow of the scSPARKL pipeline. A complete, step-by-step description of each module is provided in the Materials and Methods section.

**Fig. 3**
Histogram of Gene and cell qualities. **(a)** Mouse Brain cells and **(b)** Jurkat T932 cells. These are used for deciding the filtering thresholds, depending upon the nature of data as well as the biological question in hand.

**Fig. 4**
Comparison of scSPARKL with original studies. **(a)** tSNE projection for Brain Non-Myeloid data processed with SCANPY. **(b)** Corresponding tSNE visualization of Brain Non-Myeloid cells generated with scSPARKL. **(c)** Reference tSNE visualization for Jurkat Cell data from original author. **(d)** tSNE visualization of Jurkat-239T Cells using scSPARKL. Legends are shared for panels (a) and (b), and separately for panels (c) and (d).

**Fig. 5**
Performance and Cluster evaluation. **(a)** Bar Plot representing the total time taken (in seconds) from submitting the job to the final output i.e., clustering, for different data with different cell loads. Cells were randomly sampled and chunked from 68 K-PBMC data. **(b)** ARI scores of Clustering results from Sequential Standalone scRNA pipeline and scSPARK. **(c)** Bar Plot depicting the time (in Milliseconds) taken by each process on different data sizes. All the processes shown in the plot are exclusively Spark based, running parallelly on each core.

See this image and copyright information in PMC

References

1. Wu, A. R., Wang, J., Streets, A. M. & Huang, Y. Single-Cell transcriptional analysis. Annual Rev. Anal. Chem.10 (1), 439–462. 10.1146/annurev-anchem-061516-045228 (2017). - DOI - PubMed
1. Jindal, A., Gupta, P. & Sengupta, D. Discovery of rare cells from voluminous single cell expression data. Nat. Commun.10.1038/s41467-018-07234-6 (2018). - DOI - PMC - PubMed
1. Slovin, S. et al. Single-Cell RNA sequencing analysis: A Step-by-Step overview. Methods Mol. Biology (Clifton N J). 2284, 343–365. 10.1007/978-1-0716-1307-8_19 (2021). - DOI - PubMed
1. Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol.10.1371/journal.pcbi.1004333 (2015). - DOI - PMC - PubMed
1. Wen, L. et al. Single-cell technologies: From research to application. Innovation10.1016/J.XINN.2022.100342 (2022). - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark

Affiliations

Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials