Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 2;16(1):4114.
doi: 10.1038/s41467-025-59362-5.

Identification of leukemia-enriched signature through the development of a comprehensive pediatric single-cell atlas

Affiliations

Identification of leukemia-enriched signature through the development of a comprehensive pediatric single-cell atlas

Hope L Mumme et al. Nat Commun. .

Abstract

Single-cell transcriptome profiling enables unparalleled characterization of the heterogeneous microenvironment of pediatric leukemias. To facilitate comparative analyses and generate pediatric leukemia signatures, we collect, process, and annotate single-cell data comprising over 540,000 cells from 159 different pediatric acute leukemia (myeloid, lymphoid, mixed phenotype lineages) and healthy bone marrow (BM) samples, profiled in our lab and curated from publicly available studies. The analysis identifies a leukemia-enriched signature of nine genes with over-expression in leukemic blast compared to healthy BM cells. This signature is also consistently over-expressed in leukemia samples compared to normal BM in bulk RNA-seq datasets (over 2000 samples). Outcome-based analysis on diagnosis samples using measurable residual disease (MRD) status depicts a significant association of oncogene-induced senescence and g-protein activation pathways with MRD positivity. MRD positivity across pediatric leukemias is also correlated with significant depletion of CD8+ and CD4+ naïve T-cells and M1-macrophages at diagnosis. To enable easy access to this comprehensive pediatric leukemia single-cell atlas, we develop the Pediatric Single-cell Cancer Atlas (PedSCAtlas, https://bhasinlab.bmi.emory.edu/PediatricSCAtlas/ ). The atlas allows for quick exploration of single-cell data based on genes, cell type composition, and clinical outcomes to understand the cellular landscape of pediatric leukemias.

PubMed Disclaimer

Conflict of interest statement

Competing interests: MBh serves on the board of Canomiks Inc. as chief scientific advisor and has equity in it. DKG and DD hold equity in Meryx Inc. SSB serves as CEO of Anxomics LLC and has equity in it. The remaining authors declare no other competing interests.

Figures

Fig. 1
Fig. 1. Pediatric leukemia atlas description and signature discovery workflow.
a Dot plot showing immune cell canonical marker expression in the leukemia dataset (n = 231,883 cells). For each annotated cell type, a corresponding dot color shows the level of expression (blue: low, red: high) and dot size indicating the percent of cells with expression of the target gene. b Proportion of cell types across different leukemias (AML, B-ALL, B/My MPAL, Healthy BM, T-ALL, and T/My MPAL). The different cell types are color coded as indicated in the legend below. c UMAP plots showing the split view of cell clusters from Healthy, AML, B-ALL, B/My MPAL, T-ALL, and T/My MPAL Bone marrow samples. The cell types are colored using the same colors as shown in panel b, with nonblast, and non-progenitor cell types lassoed in the Healthy BM group. The blast and healthy cells are represented with different shades of red and blue colors respectively. d Leukemia type (e.g. AML), cytogenetic category (e.g. inv(16)), mutational status (e.g. CEBPA), immunophenotype category (e.g. ETP-ALL), age, and number of cells for samples and patients in the pediatric leukemia cohort. All single-cell data were generated using 10x genomics technologies. Sorting methods include immune (sorted to enrich immune cells based on CD45), non-leukemia (CD45 + CD19-CD10- for B-ALL), and lymphoblast (CD19 + CD10+ for B-ALL). e Pediatric leukemia-enriched signature discovery workflow with each analytical and filtering step along with threshold and outputs. For each filtering step, the number of genes passing the threshold are shown. AML acute myeloid leukemia; B-ALL b-cell acute lymphoblastic leukemia; B/My MPAL b-cell/myeloid mixed phenotype acute leukemia; Healthy BM healthy bone marrow, T-ALL t-cell acute lymphoblastic leukemia, T/My MPAL t-cell/myeloid mixed phenotype acute leukemia, HSC hematopoietic stem cell, GMP granulocyte-monocyte progenitor, ETP-ALL early T-cell precursor acute lymphoblastic leukemia, UMAP uniform manifold approximation and projection, FC fold change, B-H Benjamini-Hochberg. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Identification of pediatric leukemia-enriched signature and gene sets.
a Combined heatmap showing normalized, scaled expression of top significant (Bonferroni adj. p < 0.05) differentially expressed genes (DEGs) derived from comparing blast cells and healthy BM cells for each leukemia type (two-sided Wilcoxon rank sum test). Top 10 genes were identified based on log2FC values. b Heatmap showing similarity in DEGs for each leukemia’s blast cells as the Jaccard Similarity/Index, calculated as the number of intersecting genes divided by the size of the union of genes between two leukemia types. c Dot plot with gene sets which are common or specific to leukemia types that were identified by comparing enrichment scores (from gene set enrichment analysis) in blast and healthy BM cells (two-sided t-tests). The top significantly enriched (Benjamini-Hochberg adj. p < 0.05) gene sets were selected based on T-statistic. d Volcano plot showing differential expression analysis results from comparing different leukemias versus normal BM samples in bulk RNA-seq datasets (moderated two-sided t-test, limma). Genes significantly over-expressed (log2FC > 0.25, Benjamini-Hochberg adj. p < 0.05) in leukemia types are shown in red, genes significantly under-expressed (log2FC <−0.25, Benjamini-Hochberg adj. p < 0.05) are shown in blue, and non-significant genes are shown in grey. The shapes of the points represent the results of the differential expression analysis for different leukemia types. e A dot plot illustrating a summary of bulk RNA-seq DEG analysis (moderated two-sided t-test, limma) across different leukemias, with the color of the dots representing the log2FC, the shape representing whether the gene is over- (triangle) or under-expressed (circle) in leukemia compared to normal, and the Benjamini-Hochberg adjusted p-value represented as asterisks on the dots (*, < 0.05; **, <0.01; ***, <0.001; ****, <0.0001). AML acute myeloid leukemia, B-ALL b-cell acute lymphoblastic leukemia, B/My MPAL b-cell/myeloid mixed phenotype acute leukemia, T-ALL t-cell acute lymphoblastic leukemia, T/My MPAL t-cell/myeloid mixed phenotype acute leukemia, FC fold change. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Pediatric leukemia-enriched signature coverage and validation in external datasets.
a Heatmap of proportion of cells expressing the leukemia-enriched signature at diagnosis, per patient in the pediatric leukemia atlas, calculated as the proportion of blast cells in a sample with counts above zero. Additional patient information is annotated on top of the heatmap. Hierarchical clustering analysis was performed on samples (columns) and genes (rows) using Euclidean distance. b Average log2 fold change of the signature genes’ expression compared to the average expression of CD34, CD45/PTPRC, and CD74 in the Roy et al. healthy hematopoiesis dataset, per tissue type. Adult bone marrow (ABM), early fetal liver (eFL), fetal bone marrow (FBM), fetal liver (FL), and pediatric bone marrow (PBM). c Boxplots depict normalized expression in AML cells (Lambo et al.) labeled as malignant (red) or normal microenvironment (blue). Two-sided Wilcoxon rank sum test was used to compare the expression levels between malignant (n = 189,240) and normal (n = 119,292) cells. The p < 0.0001 represents where the p-value is estimated to be less than 2.2e-16. Middle bar represents the median, lower/upper hinges correspond to first/ third quartiles, the upper whiskers extend to the largest value no further than 1.5 times IQR. d Average expression of the signature genes across malignant AML cell types (Van Galen et al.) in heatmap and overall in bar-plot. e Distributions of module scores of the signature across MPAL and healthy reference cells (Granja et al.). HACD1 was not found in the count matrix for this dataset and therefore is not included in the score calculation. Categories include reference cells from healthy BM samples, healthy-like MPAL cells, and subtypes of malignant MPAL cells (Erythroid-like, Lymphoid-like, Myeloid-like, Progenitor-like, TNK-like). Bars in the violin plots represent the median. f Distribution of module score in malignant MPAL cells overall compared to healthy-like MPAL and healthy reference cells. AML acute myeloid leukemia, B-ALL b-cell acute lymphoblastic leukemia, B/My MPAL b-cell/myeloid mixed phenotype acute leukemia, T-ALL t-cell acute lymphoblastic leukemia, T/My MPAL t-cell/myeloid mixed phenotype acute leukemia, FC fold change. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Prognostic associations of leukemia signatures and functional analysis.
Multivariate survival analysis was performed to explore the independent associations of pediatric leukemia-enriched signature genes with outcomes. TARGET datasets from the Survival Genie tool were utilized for this analysis. In the TARGET-AML analysis, cytogenetic groups and age were included as covariates; for the TARGET-ALL-P2 and TARGET-ALL-P3 analyses, age was included as a covariate. When examining survival associations based on each gene’s expression, continuous log2(FPKM + 1) was used as a covariate. The Cox proportional hazards model was used to estimate hazard ratios (two-sided Wald test). a Scatter plot showing the hazard ratio (x-axis) and -log10 p-value (y-axis) of each gene’s association with survival across TARGET datasets. Each color corresponds to a different gene of the signature. b The survival area curve illustrating the estimated survival probability based on the discretized range of MYB expression values in the TARGET-AML-Legacy dataset (Cox HR = 1.50, p = 0.007). Expression ranges from low to high are shown with shades of blue and red colors, respectively. c Survival area curves showing the estimated survival probabilities based on MYB expression levels stratified based on cytogenetics categories in the AML cohort. d Survival area curve depicting the estimated survival probability based on the discretized range of PAN3 expression levels in TARGET-ALL-P2-B-cell dataset (Cox HR = 0.56, p = 1.08e-08). e Associated biological processes and interactions among genes from the signature from the literature or using Harmonizome 3.0 and KEGG databases. AML acute myeloid leukemia, B-ALL b-cell acute lymphoblastic leukemia, T-ALL t-cell acute lymphoblastic leukemia, HR hazard ratio, FPKM fragments per kilobase of transcript per million mapped reads. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Heterogeneity in leukemia patient single cell landscape at diagnosis based on MRD outcomes.
a Volcano plot showing differentially expressed genes (DEGs) in T/NK cells at diagnosis from minimal residual disease (MRD) negative versus positive patients. Two-sided Wilcoxon rank sum test used to identify genes significantly (Bonferroni adj. p < 0.05) up-regulated in MRD negative (log2FC < −0.5, blue) and MRD positive (log2FC > 0.5, red) groups. b Box plots with module scores of T-cell phenotype gene sets across cells at diagnosis from MRD negative (n = 11,285 cells) and positive (n = 2,301 cells) groups. Two-sided Wilcoxon rank sum tests were used to assess the difference in means. c Volcano plot showing DEGs in monocytes/macrophages from MRD negative versus positive patients at diagnosis. Two-sided Wilcoxon rank sum test used to identify genes significantly (Bonferroni adj. p < 0.05) up-regulated in MRD negative (log2FC < −0.5, blue) and MRD positive (log2FC > 0.5, red) groups. d Bar plots of M1-Macrophage and M2-Macrophage module scores for myeloid cells at diagnosis from MRD-positive (n = 904 cells, red) and -negative (n = 10,029 cells, blue) patients. Two-tailed Welch’s t-test with unequal variance was used for significance testing. Data are presented as mean values plus upper quartile. e Gene-set enrichment analysis (GSEA) on blast cells from MRD negative (purple) versus positive (gold) patients at diagnosis. Significantly dysregulated gene sets identified based on two-sided Wilcoxon rank sum tests (Benjamini-Hochberg adj. p < 0.01). The top 10 gene sets by T-statistic are shown for each group. f Violin plots showing enrichment scores (ES) for four of the top gene sets between blast cells at diagnosis from MRD negative (n = 49,542 cells, purple) and positive (n = 15,852 cells, gold) patients. Two-sided Wilcoxon rank sum tests are used to compare ES values between the groups. For boxplots in b and f, the middle bar represents the median, lower/upper hinges correspond to first/ third quartiles, the upper/lower whiskers extend to the largest/smaller value no further than 1.5 times IQR. In b, d, and f, p < 0.0001 represents where the p-value is estimated to be less than 2.2e-16. Dx, diagnosis; MRD, minimal residual disease; FC, fold change; GS, gene set. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. An overview of the PedSCAtlas resource showing datasets and analytical tools.
Overview of single-cell datasets in the Pediatric Single Cell Cancer Atlas (PedSCAtlas) and analysis modules. The single-cell RNA seq (scRNA-seq) pediatric leukemia atlas examined in this manuscript contains 82 BM samples (n = 76 leukemia, n = 6 healthy) retrieved from the Emory repository and public repositories. These 82 samples were integrated and uniformly processed to form an annotated and normalized pediatric leukemia scRNA-seq atlas. Users can access and analyze the pediatric leukemia dataset along with other leukemia, and healthy pediatric datasets through a common query page. The resource contains 540,000 cells from 159 different pediatric leukemia (AML, ALL, MPAL) collected at the time of disease diagnosis, EOI, and relapse, as well as healthy bone marrow (BM) samples. The “Analysis: Analysis of Pediatric Leukemia Datasets” module allows users to visualize cell metadata and assess gene expression of the scRNA-seq datasets available on the platform. The “DE: Exploration of DEGs” module allows users to access our DE results in both the SC and Bulk RNA-seq datasets. Finally, the “Testing: Leukemia Marker Testing” module allows users to investigate a gene of their choice and test if the gene would be a robust marker of a specific disease group. PedSCAtlas pediatric single-cell cancer atlas, DEGs differentially expressed genes, AML acute myeloid leukemia, UMAP uniform manifold approximation and projection, FC fold change. Created in BioRender. Bhasin, M. (2025) https://BioRender.com/k81i482. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Analytical outputs using PedSCAtlas resource to explore the single-cell landscape of pediatric leukemias and healthy BM.
a Common query page showing available datasets (i.e., Pediatric Leukemia Atlas, AML Lambo, et al., Healthy Pediatric BM) along with description and overview by generating UMAP plots. Analysis page with all datasets currently available on the tool. Visualization of cell metadata in the three datasets. The cellular clusters from the datasets are colored based on cell types or other user-selected categories, including sample ID, Cluster ID, MRD status, etc. b Violin plot visualization of normalized PAN3 expression across different cell types in the pediatric leukemia atlas dataset. In the boxplots overlayed, the middle bar represents the median, lower/upper hinges correspond to first/third quartiles, the upper/lower whiskers extend to the largest/smaller value no further than 1.5 times IQR. The expression is shown for different leukemic blasts (AML: n = 106,070, B-ALL: n = 7502, B/My MPAL: n = 9085, T-ALL: n = 20,072, T/My MPAL: n = 15,671), and normal progenitor (n = 2465) cells. Users can visualize gene expression through various graphical outputs, including box plots and feature plots. These visualizations allow for the grouping of cells by multiple categories, such as cell type, leukemia type, time point, and patient. This flexibility enables detailed exploration of expression patterns across diverse subsets, facilitating insights into disease-specific trends and inter-patient variability. c Violin plot visualization of normalized PAN3 expression across different cell types within the healthy adult bone marrow dataset. This visualization highlights the distribution and variability of PAN3 expression among distinct cell normal populations. Additionally, users can explore expression using violin, box, and feature plots, with the flexibility to group cells by broad or fine-granular cell types. AML acute myeloid leukemia, B-ALL b-cell acute lymphoblastic leukemia, B/My MPAL b-cell/myeloid mixed phenotype acute leukemia, T-ALL t-cell acute lymphoblastic leukemia, T/My MPAL t-cell/myeloid mixed phenotype acute leukemia, UMAP uniform manifold approximation and projection, BM bone marrow. Source data are provided as a Source Data file.

References

    1. Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature510, 363–369 (2014). - DOI - PMC - PubMed
    1. Villani, A. C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science.10.1126/science.aah4573 (2017). - PMC - PubMed
    1. Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science347, 1138–1142 (2015). - DOI - PubMed
    1. Regev, A. et al. The Human Cell Atlas. Elife6, e27041 (2017). - DOI - PMC - PubMed
    1. Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature550, 451–453 (2017). - DOI - PubMed