Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data

Joseba Sancho-Zamora¹, Akash Kanhirodan², Xabier Garrote¹, Juan Manuel Silva Rojas¹, Olivier Gevaert³, Mikel Hernaez^{4

5}, Guillermo Serrano^{1

6}, Idoia Ochoa^{1

5}

Affiliations

¹ Tecnun School of Engineering, Universidad de Navarra, Donostia, Spain.
² National Institute of Technology Calicut, India.
³ Stanford Center for Biomedical Informatics Research, Stanford University, California, USA.
⁴ Centro de Investigación Médica Aplicada (CIMA), Universidad de Navarra, Spain.
⁵ Instituto de Ciencia de los Datos e Inteligencia Artificial (DATAI), Universidad de Navarra, Spain.
⁶ Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Saudi Arabia.

PMID: 40687986
PMCID: PMC12270792
DOI: 10.1016/j.csbj.2025.06.043

Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data

Joseba Sancho-Zamora et al. Comput Struct Biotechnol J. 2025.

. 2025 Jul 1:27:2863-2870.

doi: 10.1016/j.csbj.2025.06.043. eCollection 2025.

Authors

Joseba Sancho-Zamora¹, Akash Kanhirodan², Xabier Garrote¹, Juan Manuel Silva Rojas¹, Olivier Gevaert³, Mikel Hernaez^{4

5}, Guillermo Serrano^{1

6}, Idoia Ochoa^{1

5}

Affiliations

¹ Tecnun School of Engineering, Universidad de Navarra, Donostia, Spain.
² National Institute of Technology Calicut, India.
³ Stanford Center for Biomedical Informatics Research, Stanford University, California, USA.
⁴ Centro de Investigación Médica Aplicada (CIMA), Universidad de Navarra, Spain.
⁵ Instituto de Ciencia de los Datos e Inteligencia Artificial (DATAI), Universidad de Navarra, Spain.
⁶ Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Saudi Arabia.

PMID: 40687986
PMCID: PMC12270792
DOI: 10.1016/j.csbj.2025.06.043

Abstract

The creation of single-cell atlases is essential for understanding cellular diversity and heterogeneity. However, assembling these atlases is challenging due to batch effects and the need for accurate and consistent cell annotation. Current methods for single-cell RNA and ATAC sequencing (scRNA-Seq and scATAC-Seq), while effective for integration, are not optimized for cell annotation. Additionally, many annotation tools rely on external databases or reference scRNA-Seq datasets, which may limit their adaptability to specific study needs, especially for rare cell-types or scATAC-Seq data. Here, we introduce JIND-Multi, a new framework designed to transfer cell-type labels across multiple annotated datasets. Notably, JIND-Multi can be applied to both scRNA-Seq and scATAC-Seq data, requiring in each case annotated data of the same type, contrary to most methods for scATAC-Seq data that require (paired) annotated scRNA-Seq data. In both cases, JIND-Multi significantly reduces the proportion of unclassified cells while maintaining the accuracy and performance of the original JIND model, and compares favorable to state-of-the-art methods. These results prove its versatility and effectiveness across different single-cell sequencing technologies. JIND-Multi represents an improvement in cell annotation, reducing unassigned cells and offering a reliable solution for both scRNA-Seq and scATAC-Seq data. Its ability to handle multiple labeled datasets enhances the precision of annotations, making it a valuable tool for the single-cell research community. JIND-Multi is publicly available at: https://github.com/ML4BM-Lab/JIND-Multi.git.

Keywords: Cell-type annotation; Data integration; Deep learning; Neural networks; Single-cell atlases; scATAC-Seq; scRNA-Seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
Simplified JIND-Multi workflow (see also Figure S1). **Step 1.***Initialization of the Prediction Model*: A NN based encoder and cell-type classifier are trained on the source dataset (colored blue). This step creates a latent space (green) suitable for cell-type classification and fixes the encoder and classifier for the remaining steps (colored grey). **Step 2.***Model Adaptation through Data Integration*: For each additional annotated dataset, the encoder generates a latent space with batch effect (red). To remove this noise, a NN-based generator is trained for each dataset to generate a latent space (green) that can be correctly classified by the already trained classifier, and hence that is aligned to that of the source's. This step increases the number of samples that can be used to train the Generative Adversarial Network (GAN) of the next step. *Step 3: Prediction on New Unlabeled Data*. **Step 3A.***Integration of Unlabeled Batch into Latent Space*: a GAN is trained such that the generator (G_t) produces a latent space for the target batch indistinguishable from that of the sources'. **Step 3B.***Cell-type inference*: the corrected latent space for the target batch is used as input to the trained classifier for cell-type inference.

**Fig. 2**
Comparative analysis of average accuracy and rejection rate across 10 iterations and different batch sizes on scRNA-Seq (top) and scATAC-Seq (bottom). The figure evaluates the performance of JIND (orange) and JIND-Multi (blue) across all datasets. For scRNA-Seq, we also include results for MARS (purple), and for scATAC-Seq, the comparison includes Cellcano (green), AtacAnnoR (red), MultiKano (orchid) and SANGO (golden). For all methods we indicate the percentage of cells correctly predicted, and for JIND-Multi and JIND we also include the percentage of filtered (rejected) cells. Datasets are marked with distinct shapes, and standard deviation is indicated by error tolerances.

**Fig. 3**
Results of the best trial with *Brain Neurips scRNA-Seq* dataset for JIND-Multi and MARS trained on batches *C4-AD2*, and JIND trained on batch C4. A. Confusion matrices showing the prediction accuracies for the different cell-types on the target batch. JIND-Multi outperforms its predecessor JIND by significantly reducing the number of rejected cells and improving accuracy across all cell-types. MARS struggles to differentiate between various types of *Blood vascular Endothelial Cells (BEC)*, as well as between *Pericytes* and *Smooth Muscle Cells (SMC)*, which significantly impacts the overall accuracy. B. UMAP of the cells' gene expression profiles on the target batch colored by true labels and by predictions with JIND-Multi, JIND and MARS. Cells are colored by their true cell-type for correct predictions, and in black otherwise. Unassigned cells in JIND-Multi and JIND are denoted by a triangle. The findings underscore MARS's challenges in discerning between BEC subclasses and in correctly classying SMC cells. JIND-Multi correctly classifies most SMC cells, and it also encounters some difficulties between BEC subtypes, although not as pronounced as MARS and improving upon JIND.

See this image and copyright information in PMC

References

1. Rood J.E., Maartens A., Hupalowska A., Teichmann S.A., Regev A. Impact of the human cell atlas on medicine. Nat Med. 2022;28(12):2486–2496. - PubMed
1. Hrovatin K., Sikkema L., Shitov V.A., et al. Considerations for building and using integrated single-cell atlases. Nat Methods. 2024 - PubMed
1. Korsunsky I., Millard N., Fan J., et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods. 2019;16:1289–1296. - PMC - PubMed
1. Lopez R., Regier J., Cole M.B., et al. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15:1053–1058. - PMC - PubMed
1. Hao Y., Stuart T., Kowalski M.H., et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42:293–304. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Elsevier Science
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data

Affiliations

Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous