. 2025 Jul 23:27:3264-3274.

doi: 10.1016/j.csbj.2025.07.019. eCollection 2025.

PCLDA: An interpretable cell annotation tool for single-cell RNA-sequencing data based on simple statistical methods

Kailun Bai¹, Belaid Moa², Xiaojian Shao^{1

3}, Xuekui Zhang¹

Affiliations

¹ Department of Mathematics and Statistics, University of Victoria, Victoria BC, Canada.
² Digital Research Alliance of Canada, Victoria BC, Canada.
³ Digital Technologies Research Centre, National Research Council Canada, Ottawa ON, Canada.

PMID: 40778314
PMCID: PMC12329077
DOI: 10.1016/j.csbj.2025.07.019

PCLDA: An interpretable cell annotation tool for single-cell RNA-sequencing data based on simple statistical methods

Kailun Bai et al. Comput Struct Biotechnol J. 2025.

. 2025 Jul 23:27:3264-3274.

doi: 10.1016/j.csbj.2025.07.019. eCollection 2025.

Authors

Kailun Bai¹, Belaid Moa², Xiaojian Shao^{1

3}, Xuekui Zhang¹

Affiliations

¹ Department of Mathematics and Statistics, University of Victoria, Victoria BC, Canada.
² Digital Research Alliance of Canada, Victoria BC, Canada.
³ Digital Technologies Research Centre, National Research Council Canada, Ottawa ON, Canada.

PMID: 40778314
PMCID: PMC12329077
DOI: 10.1016/j.csbj.2025.07.019

Abstract

Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, yet accurate and consistent cell-type annotation remains a crucial challenge. Numerous automated tools exist, but their complex modeling assumptions can hinder reliability across varied datasets and protocols. We propose PCLDA, a pipeline composed of three modules: t-test-based gene screening, principal component analysis (PCA) and linear discriminant analysis (LDA), all built on simple statistical methods. An ablation study shows that each module in PCLDA contributes significantly to performance and robustness, with two novel enhancements in the second module yielding substantial gains. Despite these additions, the model retains its original assumptions, computational efficiency, and interpretability. Benchmarking against nine state-of-the-art methods across 22 public scRNA-seq datasets and 35 distinct evaluation scenarios, PCLDA consistently achieves top-tier accuracy under both intra-dataset (cross-validation) and inter-dataset (cross-platform) conditions. Notably, when reference and query data are generated via different protocols, PCLDA remains stable and often outperforms more complex machine-learning approaches. Furthermore, PCLDA offers strong interpretability, attributed to the linear nature of its PCA and LDA modules. The final decision boundaries are linear combinations of the original gene expression values, directly reflecting the contribution of each gene to the classification. Top-weighted genes identified by PCLDA better capture biologically meaningful signals in enrichment analyses than those selected via marginal screening alone, offering deeper functional insights into cell-type specificity. In conclusion, our work underscores the utility of carefully enhanced simple statistics methods for single-cell annotation. PCLDA's simplicity, interpretability, and consistently high performance make it a practical, reliable alternative to more complex annotation pipelines. Code is available on GitHub:https://github.com/kellen8hao/PCLDA.

Keywords: Cell type annotation; Interpretable machine learning; Linear discriminant analysis; Simple statistics; Single-cell genomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
**Flowchart of the PCLDA pipeline.** The pipeline consists of three modules: (1) the *Data Preprocessing Module*, where raw gene expression data undergo normalization, log-transformation, and gene screening based on t-statistics; (2) the *Integrated Supervised PCA Module*, where principal components (PCs) are computed from concatenated reference (training) and query (test) datasets using the genes selected in the preprocessing step, and top PCs are chosen based on their supervised discriminatory ability among cell types rather than solely on explained variance; and (3) the *Classification Module (LDA Module)*, where a Linear Discriminant Analysis (LDA) model is trained using the selected PCs from the training set and then applied to classify cells in the test set. Each cell is ultimately assigned to the cell type with the highest probability according to the LDA decision function.

**Fig. 2**
**Sensitivity analysis for PCLDA in cross-platform experiments.** (A) Effect of varying the number of genes per cell type during the gene screening step on cell-type classification accuracy across multiple datasets. Tested gene counts range from 10 to 1000. (B) Effect of varying the number of principal components (PCs) on cell-type classification accuracy across multiple datasets. Tested PC numbers range from 10 to 500. In both panels, the x-axis represents the number of genes or PCs, and the y-axis shows classification accuracy.

**Fig. 3**
**Performance comparison across pipeline configurations.** Models were evaluated using all cross-platform datasets listed in Table 2. Six configurations were compared to assess the contribution of each component in the proposed PCLDA pipeline: (i) LDA only, (ii) LDA applied to gene-screened data, (iii) LDA applied to PCA dimension-reduced data, (iv) PCLDA_topPC (modified PCLDA using the conventional approach of selecting PCs by highest variance), (v) PCLDA_Ref (modified PCLDA using the conventional approach of applying PCA only to reference data), and (vi) PCLDA (our proposed pipeline). Paired Wilcoxon tests comparing PCLDA with each modified version confirm that the superior performance of PCLDA (as visually evident in the figure) is statistically significant (all p-values <0.002).

**Fig. 4**
**Performance evaluation using cross-validation experiments.** (A) Heatmap comparing the accuracy of PCLDA and nine other cell-type annotation methods across 14 cross-validation experiments (in the same order as Table 1). Rows represent individual experiments, with the bottom row indicating the average accuracy of each method. Columns represent the ten methods, ordered left-to-right by their average accuracy. A boxplot above the heatmap summarizes the distribution of accuracies for each method across the 14 experiments. (B) Detailed annotation results from PCLDA on the mouse brain dataset profiled by Drop-seq, demonstrating near-perfect performance.

**Fig. 5**
**Performance evaluation using cross-platform (external validation) annotation experiments.** (A) Heatmap comparing the annotation accuracy of PCLDA with nine competing methods under cross-platform scenarios. Methods are sorted from left to right based on their average accuracy across datasets. Rows represent matched reference-query dataset pairs that differ only by sequencing protocols (listed in the same order as Table 2). The heatmap follows the format of Fig. 4, but row labels here indicate the protocol pairs (reference–query), abbreviated as follows: iD = inDrops; CL = CEL-Seq2; SM = SMARTer; SM2 = Smart-seq2; DR = Drop-seq; 10X(v2) = 10x Chromium (v2); 10X(v3) = 10x Chromium (v3); SW = Seq-Well; FC1 = Fluidigm C1. A boxplot above the heatmap summarizes the distribution of accuracies for each method across all experiments. (B) Detailed annotation performance of PCLDA on human pancreas datasets, using SMART-seq2 as the reference and CEL-Seq2 as the query.

**Fig. 6**
**Enriched GO terms for the different cell type related top 100 genes**: (A) Enriched GO terms for the top 100 genes associated with Acinar cells. (B) Enriched GO terms for the top 100 genes associated with Macrophages. (C) Enriched GO terms for the top 100 genes associated with Endothelial cells.

See this image and copyright information in PMC

References

1. Nawy Tal. Single-cell sequencing. Nat Methods. 2014;11(1):18. - PubMed
1. Gawad Charles, Koh Winston, Quake Stephen R. Single-cell genome sequencing: current state of the science. Nat Rev, Genet. 2016;17(3):175–188. - PubMed
1. Svensson Valentine, Natarajan Kedar Nath, Ly Lam-Ha, Miragaia Ricardo J., Labalette Charlotte, Macaulay Iain C., et al. Power analysis of single-cell rna-sequencing experiments. Nat Methods. 2017;14(4):381–387. - PMC - PubMed
1. Jovic Dragomirka, Liang Xue, Zeng Hua, Lin Lin, Xu Fengping, Luo Yonglun. Single-cell RNA sequencing technologies and applications: a brief overview. Clin Transl Med. 2022;12(3):e694. - PMC - PubMed
1. Lummertz da Rocha Edroaldo, Rowe R. Grant, Lundin Vanessa, Malleshaiah Mohan, Jha Deepak Kumar, Rambo Carlos R., et al. Reconstruction of complex single-cell trajectories using CellRouter. Nat Commun. 2018;9(1):892. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Elsevier Science
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PCLDA: An interpretable cell annotation tool for single-cell RNA-sequencing data based on simple statistical methods

Affiliations

PCLDA: An interpretable cell annotation tool for single-cell RNA-sequencing data based on simple statistical methods

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

LinkOut - more resources

Full Text Sources