Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1:27:2863-2870.
doi: 10.1016/j.csbj.2025.06.043. eCollection 2025.

Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data

Affiliations

Leveraging multiple labeled datasets for the automated annotation of single-cell RNA and ATAC data

Joseba Sancho-Zamora et al. Comput Struct Biotechnol J. .

Abstract

The creation of single-cell atlases is essential for understanding cellular diversity and heterogeneity. However, assembling these atlases is challenging due to batch effects and the need for accurate and consistent cell annotation. Current methods for single-cell RNA and ATAC sequencing (scRNA-Seq and scATAC-Seq), while effective for integration, are not optimized for cell annotation. Additionally, many annotation tools rely on external databases or reference scRNA-Seq datasets, which may limit their adaptability to specific study needs, especially for rare cell-types or scATAC-Seq data. Here, we introduce JIND-Multi, a new framework designed to transfer cell-type labels across multiple annotated datasets. Notably, JIND-Multi can be applied to both scRNA-Seq and scATAC-Seq data, requiring in each case annotated data of the same type, contrary to most methods for scATAC-Seq data that require (paired) annotated scRNA-Seq data. In both cases, JIND-Multi significantly reduces the proportion of unclassified cells while maintaining the accuracy and performance of the original JIND model, and compares favorable to state-of-the-art methods. These results prove its versatility and effectiveness across different single-cell sequencing technologies. JIND-Multi represents an improvement in cell annotation, reducing unassigned cells and offering a reliable solution for both scRNA-Seq and scATAC-Seq data. Its ability to handle multiple labeled datasets enhances the precision of annotations, making it a valuable tool for the single-cell research community. JIND-Multi is publicly available at: https://github.com/ML4BM-Lab/JIND-Multi.git.

Keywords: Cell-type annotation; Data integration; Deep learning; Neural networks; Single-cell atlases; scATAC-Seq; scRNA-Seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Simplified JIND-Multi workflow (see also Figure S1). Step 1.Initialization of the Prediction Model: A NN based encoder and cell-type classifier are trained on the source dataset (colored blue). This step creates a latent space (green) suitable for cell-type classification and fixes the encoder and classifier for the remaining steps (colored grey). Step 2.Model Adaptation through Data Integration: For each additional annotated dataset, the encoder generates a latent space with batch effect (red). To remove this noise, a NN-based generator is trained for each dataset to generate a latent space (green) that can be correctly classified by the already trained classifier, and hence that is aligned to that of the source's. This step increases the number of samples that can be used to train the Generative Adversarial Network (GAN) of the next step. Step 3: Prediction on New Unlabeled Data. Step 3A.Integration of Unlabeled Batch into Latent Space: a GAN is trained such that the generator (Gt) produces a latent space for the target batch indistinguishable from that of the sources'. Step 3B.Cell-type inference: the corrected latent space for the target batch is used as input to the trained classifier for cell-type inference.
Fig. 2
Fig. 2
Comparative analysis of average accuracy and rejection rate across 10 iterations and different batch sizes on scRNA-Seq (top) and scATAC-Seq (bottom). The figure evaluates the performance of JIND (orange) and JIND-Multi (blue) across all datasets. For scRNA-Seq, we also include results for MARS (purple), and for scATAC-Seq, the comparison includes Cellcano (green), AtacAnnoR (red), MultiKano (orchid) and SANGO (golden). For all methods we indicate the percentage of cells correctly predicted, and for JIND-Multi and JIND we also include the percentage of filtered (rejected) cells. Datasets are marked with distinct shapes, and standard deviation is indicated by error tolerances.
Fig. 3
Fig. 3
Results of the best trial with Brain Neurips scRNA-Seq dataset for JIND-Multi and MARS trained on batches C4-AD2, and JIND trained on batch C4. A. Confusion matrices showing the prediction accuracies for the different cell-types on the target batch. JIND-Multi outperforms its predecessor JIND by significantly reducing the number of rejected cells and improving accuracy across all cell-types. MARS struggles to differentiate between various types of Blood vascular Endothelial Cells (BEC), as well as between Pericytes and Smooth Muscle Cells (SMC), which significantly impacts the overall accuracy. B. UMAP of the cells' gene expression profiles on the target batch colored by true labels and by predictions with JIND-Multi, JIND and MARS. Cells are colored by their true cell-type for correct predictions, and in black otherwise. Unassigned cells in JIND-Multi and JIND are denoted by a triangle. The findings underscore MARS's challenges in discerning between BEC subclasses and in correctly classying SMC cells. JIND-Multi correctly classifies most SMC cells, and it also encounters some difficulties between BEC subtypes, although not as pronounced as MARS and improving upon JIND.

References

    1. Rood J.E., Maartens A., Hupalowska A., Teichmann S.A., Regev A. Impact of the human cell atlas on medicine. Nat Med. 2022;28(12):2486–2496. - PubMed
    1. Hrovatin K., Sikkema L., Shitov V.A., et al. Considerations for building and using integrated single-cell atlases. Nat Methods. 2024 - PubMed
    1. Korsunsky I., Millard N., Fan J., et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods. 2019;16:1289–1296. - PMC - PubMed
    1. Lopez R., Regier J., Cole M.B., et al. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15:1053–1058. - PMC - PubMed
    1. Hao Y., Stuart T., Kowalski M.H., et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42:293–304. - PMC - PubMed

LinkOut - more resources