Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul;12(25):e2500870.
doi: 10.1002/advs.202500870. Epub 2025 May 2.

scCompass: An Integrated Multi-Species scRNA-seq Database for AI-Ready

Affiliations

scCompass: An Integrated Multi-Species scRNA-seq Database for AI-Ready

Pengfei Wang et al. Adv Sci (Weinh). 2025 Jul.

Abstract

Emerging single-cell sequencing technology has generated large amounts of data, allowing analysis of cellular dynamics and gene regulation at the single-cell resolution. Advances in artificial intelligence enhance life sciences research by delivering critical insights and optimizing data analysis processes. However, inconsistent data processing quality and standards remain to be a major challenge. Here scCompass is proposed, which provides a comprehensive resource designed to build large-scale, multi-species, and model-friendly single-cell data collection. By applying standardized data pre-processing, scCompass integrates and curates transcriptomic data from nearly 105 million single cells across 13 species. Using this extensive dataset, it is able to identify stable expression genes (SEGs) and organ-specific expression genes (OSGs) in humans and mice. Different scalable datasets are provided that can be easily adapted for AI model training and the pretrained checkpoints with state-of-the-art single-cell foundation models. In summary, scCompass is highly efficient and scalable database for AI-ready, which combined with user-friendly data sharing, visualization, and online analysis, greatly simplifies data access and exploitation for researchers in single-cell biology (http://www.bdbe.cn/kun).

Keywords: AI‐ready; multi‐species; scRNA‐seq database; single‐cell.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Data curation pipeline of scCompass. a) Illustration of scCompass data collection, QC, cell type annotation, and downstream analysis. b) Cell counts for 13 species before and after QC (left panel), along with the percentage of filtered‐out cells (right panel). c) Bubble chart depicting the cell proportions and developmental origins of organs across 13 species. d–g) Sex correction in humans and mice. d) Human sex distribution before (left panel) and after correction (right panel). e) The Sankey diagram illustrates the results of sex correction for each human sample: left, original label, right, corrected label. f) Mouse sex distribution before (left panel) and after correction (right panel). g) The Sankey diagram illustrates the results of sex correction for each mouse sample: left, original label, right, corrected label.
Figure 2
Figure 2
Single‐cell atlas construction of human, mouse, and monkey. a) Cell QC and cell type annotation (SCimilarity and scMayoMap). b–d) Single‐cell atlas of human, mouse, and monkey with t‐SNE, colored dots represent different tissues. e–g) Cell types in the brain of humans e), mouse f), and monkey g), colored dots represent different cell types. h–j) Cell types in the lung of human h), mouse i), and monkey j), colored dots represent different cell types. k) Statistics of the proportion and quantity of cell types in the brain and lung for humans, mice, and monkeys. Astro, Astrocyte. Epc, Ependymal Cell. GC, Glial Cell. MG, Microglial Cell. Oligo, Oligodendrocyte. OPC, Oligodendrocyte Precursor Cell. PC, Pericyte. AT1, Type I Pneumocyte. AT2, Type II Pneumocyte. BC, Basal Cell. cDCs, Conventional Dendritic Cell. EOS, Eosinophil. EC, Endothelial Cell. FB, Fibroblast. MC, Mast Cell. Neut, Neutrophil. pDCs, Plasmacytoid Dendritic Cell. SMCs, Smooth Muscle Cell.
Figure 3
Figure 3
SEG analysis of human and mouse. a) The zeros per gene across human single‐cell data. b) Number of common genes between SEGs and HKGs for humans with venn plot. c) The zeros per gene across mouse data. d) Number of shared genes between SEGs and HKGs for a mouse. e) Heatmap showing the distribution of SEGs in the top 10 organs for humans (left panel) and mice (right panel). f) tSNE visualization showing the expression pattern of EEF1A1 of unique hSEGs and CKAP4 of unique hSEGs, respectively, in indicated the top 30 organs of humans. g) tSNE visualization showing the expression pattern of Malat1 of unique mSEGs and Rin2 of unique mSEGs, respectively, in indicated the top 30 organs of the mouse. h,i) K‐means clustering of 20 randomly sampled single‐cell data evaluated with the Calinski‐Harabasz index and silhouette score, using all expressed genes, HKGs, and SEGs identified from this study for human (hSEGs) and mouse (mSEGs). j,k) UMAP plots generated from human j) and mouse k) single‐cell data randomly selected from 20 samples using all expressed genes, HKGs, or SEGs. l) Common SEGs and HKGs between humans and mice. m) Comparison of conservation for common SEGs and HKGs in human and mouse genomes, with p values calculated using a two‐sided Wilcoxon rank‐sum test. n) Overrepresentation analysis of SEGs shared between hSEG and mSEG (common SEGs) and HKGs shared between hHKG and mHKG (common HKG), using Gene Ontology (GO). (BM, Bone Marrow).
Figure 4
Figure 4
Organ differential expression gene analysis of human and mouse. a,b) Log values of differential expression genes in various organs between human and mouse. c,d) GRN constructed with OSGs of brain (left panel), heart (middle panel), and liver (right panel). Nodes with orange color represent the common TFs between human and mouse, the yellow nodes represent the distinct TFs of human or mouse, and the blue nodes represent the OSGs. e,f) GO enrichment related to cellular functions of predicted target genes in brain, heart, and lung. Dot color represents significance level of enrichment analysis and dot size is count of target genes classified in GO terms. p values were calculated using a hypergeometric test. Multiple comparisons adjustment was performed using the Benjamini and Hochberg method.
Figure 5
Figure 5
Evaluation of scCompass AI‐Ready adaptability. a) A Pareto chart illustrates the gene expression profiles of cells sampled from scCompass and CELLxGENE. b) Accuracy of GeneCompass models trained on various scales of single‐cell samples, evaluated on hMS, hLung, hLiver, mBrain, mPancreas, and mLung for cell type annotation. In the line plots, the green dots line represents pretraining with scCompass data, while blue dots line represents pretraining with CELLxGENE data. c) Precision and recall evaluation of cell type annotation tasks for different models (scGPT, Geneformer, GeneCompass) on 5 million human sampled cells.
Figure 6
Figure 6
Illustration of different modules in the scCompass system. a) Search module accepts diverse search entries including sample ID, tissue, and organism. b) Interactive visualization in the Browse section for individual datasets of single‐cell sample and integrated datasets of the same tissue. c) Statistics figures show the top tissues that have the most numbers of samples or cells across all datasets. d) Download feature provides source code and h5ad file for each sample. e) The analysis section includes three online tools scNorm, cellAnno, and sexCorrect for data normalization, cell annotation, and sex determination. f) The Model module contains models and embeddings that have been trained or fine‐tuned with scCompass.

Similar articles

Cited by

References

    1. Sun F., Li H., Sun D., Fu S., Gu L., Shao X., Wang Q., Dong X., Duan B., Xing F., Wu J., Xiao M., Zhao F., Han J. J., Liu Q., Fan X., Li C., Wang C., Shi T., Science China Life Sciences 2024, 1, 68. - PubMed
    1. Beumer J., Clevers H., Nat. Rev. Mol. Cell Biol. 2021, 22, 39. - PubMed
    1. Qiu C., Martin B. K., Welsh I. C., Daza R. M., Le T.‐M., Huang X., Nichols E. K., Taylor M. L., Fulton O., O'Day D. R., Gomes A. R., Ilcisin S., Srivatsan S., Deng X., Disteche C. M., Noble W. S., Hamazaki N., Moens C. B., Kimelman D., Cao J., Schier A. F., Spielmann M., Murray S. A., Trapnell C., Shendure J., Nature 2024, 626, 1084. - PMC - PubMed
    1. Hong M., Tao S., Zhang L., Diao L.‐T., Huang X., Huang S., Xie S.‐J., Xiao Z.‐D., Zhang H., J. Hematol. Oncol. 2020, 13, 166. - PMC - PubMed
    1. Corchete L. A., Rojas E. A., Alonso‐López D., De L., Rivas J., Gutiérrez N. C., Burguillo F. J., Sci. Rep. 2020, 10, 19737. - PMC - PubMed

LinkOut - more resources