Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov;22(11):2275-2285.
doi: 10.1038/s41592-025-02854-5. Epub 2025 Oct 22.

scooby: modeling multimodal genomic profiles from DNA sequence at single-cell resolution

Affiliations

scooby: modeling multimodal genomic profiles from DNA sequence at single-cell resolution

Johannes C Hingerl et al. Nat Methods. 2025 Nov.

Abstract

Understanding how regulatory sequences shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA sequencing and epigenomic profiling provides opportunities to build models capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multimodal technologies. Here, we introduce scooby, a framework to model genomic profiles of single-cell RNA-sequencing coverage and single-cell assay for transposase-accessible chromatin using sequencing insertions from sequence at single-cell resolution. For this, we leverage the pretrained multiomics profile predictor Borzoi and equip it with a cell-specific decoder. Scooby recapitulates cell-specific expression levels of held-out genes and identifies regulators and their putative target genes. Moreover, scooby allows resolving single-cell effects of bulk expression quantitative trait loci and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.

PubMed Disclaimer

Conflict of interest statement

Competing interests: J.D.B. holds patents related to ATAC-seq and is an SAB member of Camp4 and seqWell. F.J.T. consults for Immunai, Singularity Bio, CytoReason and Omniscope, and has ownership interest in Dermagnostix and Cellarity. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. scooby accurately predicts cell-state-specific expression and accessibility profiles from single-cell data.
a, scooby integrates a pretrained sequence-to-profile model with a cell-state-specific decoder to model genomic profiles at single-cell resolution. The pretrained model is fine-tuned on the target dataset using a parameter-efficient strategy, generating an adapted sequence embedding at 32-bp resolution. The cell-state-specific decoder takes this sequence embedding together with embedding vectors of single cells as input to predict scATAC-seq insertion and scRNA-seq coverage profiles at single-cell level. b, Uniform manifold approximation and projection (UMAP) visualization of the 10x multiome NeurIPS bone marrow dataset integrated with Poisson-MultiVI, colored by cell type. c, Representative example of predicted and observed gene expression (top) and accessibility (bottom) profiles of an erythroblast and a megakaryocyte–erythroid progenitor cell and its 100 nearest neighbors at the SLC25A37 locus (part of the test set). d, Distribution of the correlation between predicted and observed profiles on test sequences (n = 210 representative cells; Methods). Box plots showing the distribution of Pearson R values for different comparisons of single-cell ATAC-seq and RNA-seq profiles. Comparisons include single cells versus corresponding pseudobulk, single cells versus scooby, single cells versus 100 nearest neighbors and 100 nearest neighbors versus scooby. All pairwise comparisons per assay are statistically significant (two-sided Wilcoxon rank-sum test, ATAC: P = 3 × 10−36, 1 × 10−35, 7 × 10−34. RNA: P = 3 × 10−36, 4 × 10−36, 5 × 10−20). In all box plots, the central line denotes the median, boxes represent the interquartile range and whiskers show the distribution except for outliers. Outliers are all points outside 1.5 times the interquartile range. B, B cell; T, T cell; Mono, monocyte; prog, progenitor; HSC, hematopoietic stem cell; ILC, innate lymphoid cell; Lymph, lymphoid; MK/E, megakaryocyte and erythrocyte; G/M, granulocyte and myeloid; NK, natural killer cell; cDC2, classical dendritic cell type 2; pDCs, plasmacytoid dendritic cells.
Fig. 2
Fig. 2. scooby accurately models cell-type-specific gene expression counts and generalizes to unseen cell states.
a, Predicted and observed profiles are aggregated into a gene expression count matrix by summing coverage over exons. We obtain pseudobulk counts by summing over all predictions of every cell for each cell type. b, Normalized gene expression matrix (Methods) for cell-type-specific genes, observed (top) and predicted (bottom). Each row is a marker gene from test (black) or validation (gray), and each column is a randomly selected cell. Cells are grouped by cell type (bottom track) c, We evaluate scooby’s performance using two metrics: the correlation between predicted and observed gene expression counts within each cell type (left) and the model’s ability to capture cell-type-specific deviations of gene expression to gene mean (right). d, Distribution of gene-level Pearson correlation between log-transformed predicted and observed counts of scRNA-seq reads overlapping exons across cell types. e, Predicted against measured between-cell-type deviations of gene expression. Exemplarily highlighted combinations of marker gene and cell type show strong deviations from the mean expression level. f, Across-gene Pearson correlation between log-transformed predicted and observed normoblast gene expression counts using an ablated model that was not trained on normoblast cells. Each bar corresponds to predictions done using the single-cell embeddings of cells of a different cell type. g, Mean-normalized observed and predicted gene expression of HEMGN along the diffusion pseudotime axis representing erythropoietic differentiation. Both the full and the no-normoblast model accurately recapitulate the expression dynamics. Dots are colored by cell type, and lines are smoothed with a rolling mean (window size, 200 cells).
Fig. 3
Fig. 3. In silico motif mutation enables TF motif effect scoring and reveals lineage and cell-state-specific regulators.
a, Schematic of TF motif effect scoring via in silico motif mutation. b, Pearson correlation of TF motif effect score with TF expression for scooby against chromVAR. The gray area marks the zone of improvement. We used a one-sided Wilcoxon test to compute the P value. c, Same as b for a scooby model trained on scRNA-seq only. d, Heat map of average TF motif effect score per TF family (columns) across cell types (rows). e, Median-normalized effect of GATA1 in silico motif mutation on accessibility and expression (top) and median-normalized expression of GATA1 along the diffusion pseudotime axis representing erythropoietic differentiation (bottom). Dots are colored by cell type, and lines are smoothed with a rolling mean (window size, 200 cells). f, UMAP visualization of multiomic metacells obtained from paired scRNA-seq and scATAC-seq data of epicardioid cells across multiple days, colored by cell type. The JCF cells (circle) and their transitions (arrows) to their two descendant cell types—cardiomyocytes and epicardial cells—are highlighted. g, CellRank transition probabilities toward epicardial and cardiomyocyte states within the JCF population. h, Correlation of TF motif effect scores with transition probability toward the cardiomyocyte (blue) and epicardial fate (yellow). i, Min–max scaled TF motif effect scores of GATA4 (left) and FOS (right) in the JCF cluster.
Fig. 4
Fig. 4. scooby-predicted variant effects are concordant with reported effects for bulk and single-cell eQTL studies and exhibit cell-type specificity.
a, Spearman correlation of predicted effects (log-fold change) with observed normalized eQTL effects for scooby against Borzoi. Each point indicates a cell type (OneK1K) or a tissue (GTEx). Dashed line marks y = x. Scooby significantly outperforms track-matched Borzoi across the OneK1K cell types (Wilcoxon rank-sum, two-sided, P = 0.001). b, Same as a, but for scooby against seq2cells. Scooby significantly outperforms seq2cells (Wilcoxon rank-sum, two-sided, P = 5 × 10−4). c, Predicted aggregated effects (log-fold change) versus observed whole-blood eQTL effect sizes. Red dotted lines mark thresholds below which predicted fold changes are deemed negligible (absolute fold change, 3.5%). Percentages quantify variants within each quadrant: blue indicates all variants; red denotes variants passing the 3.5% predicted effect threshold. d, Proportion of concordant eQTL predictions (same direction as observed), as a function of distance to the TSS when filtering for non-negligible predicted effect (red) or without filtering (blue). Dashed blue line indicates the mean proportion of concordant eQTL predictions across all distances (0.23). Stars indicate significance over random performance (Binomial test). e, Schematic of the cell-type-specific evaluation. Cell-type-specific effects are obtained from model predictions, from the cell-specific accessibility in the peak closest to the variant position or via the pseudobulk expression levels of the eGene. For each approach, cell types are ranked by the absolute magnitude of the effect to distinguish cell types with and without fine-mapped eQTL associations. f, Precision in recovering cell types with fine-mapped eQTL associations when considering the top k most highly ranked cell types using methods from e. Asterisks indicate significance when comparing Scooby to the target gene expression baseline (two-sided Fisher exact test, P = 0.04, 0.02, 0.01, 0.03). eSNP, eQTL single-nucleotide polymorphism.
Fig. 5
Fig. 5. scooby allows cell-type-specific delineation of bulk eQTLs.
a, Clustermap of eQTL effect size predictions across cell types. Left color bar indicates lineage membership. Genes were clustered according to their predicted effect size per cell type. Highlighted genes (and their fine-mapped eQTLs) have predicted variable effects (black) and were considered for an overlap with the GWAS Catalog. eQTL-GWAS term matches are colored in red. b, Heat map of gene–variant pairs with strong cell-type-specific effects and matching GWAS terms. c, Predicted fold change in gene expression (top) and accessibility (bottom) between the alternative and reference alleles of variant rs143664050 in CD14+ monocytes and erythroblasts. Sequence attributions revealed the destruction of an SPI1 motif to only affect model outputs in CD14+ monocytes (Methods). d, UMAP of the NeurIPS dataset colored by observed normalized SPI1 expression levels. e, UMAP showing the effect of variant rs143664050 on TES expression levels.
Extended Data Fig. 1
Extended Data Fig. 1. Scooby architecture overview.
Neural network diagram. Green boxes denote trainable parameters, blue boxes depict frozen (non-trained) model parts. 524 kb of DNA sequence is processed by a LoRA augmented Borzoi stem with an additional MLP on the Borzoi embedding (left), whereas the single cell embedding is passed through a MLP (right) to predict the filter weights (red boxes) used to decode the sequence embedding.
Extended Data Fig. 2
Extended Data Fig. 2. Memory-efficient storage of single-cell RNA-seq data.
Memory-efficient storage of single-cell RNA-seq data using a modified snapATAC2.0 format. Each row in the resulting sparse matrix represents a cell, and each column represents a genomic position. Reads are stored at their start positions with values indicating read length. Split reads are stored as multiple entries. Negative values indicate reads mapped to the negative strand.
Extended Data Fig. 3
Extended Data Fig. 3. Scooby predicted against observed counts for two representative cell types.
Log-transformed predicted versus observed counts of scRNA-seq reads overlapping exons for the worst (a) and best (b) predicted cell type.
Extended Data Fig. 4
Extended Data Fig. 4. Scooby predicts gene expression more accurately than seq2cells.
a, Across-gene Pearson correlation for all cell types comparing scooby and seq2cells. b, Between-cell-type Pearson correlation after subtracting gene and cell mean gene expression comparing scooby and seq2cells.
Extended Data Fig. 5
Extended Data Fig. 5. Scooby model ablation studies.
Performance comparison of alternative modeling approaches at predicting gene expression counts and binned profiles.
Extended Data Fig. 6
Extended Data Fig. 6. TF motif effect score comparison with scBasset.
a, Pearson correlation of TF motif effect score with TF expression for scooby against scBasset. The gray area marks the zone of improvement. b, Same as a for a scooby trained on scRNA-seq only.
Extended Data Fig. 7
Extended Data Fig. 7. TF motif effect scoring allows the investigation of TFs and target gene regulation.
a, Heatmap of TF motif effect scores for differentially expressed TFs for all cell types. Genes and cell types are clustered according to their TF motif effect scores. b, Heatmap of genes with high motif mutation effects of GATA1, KLF1 and TAL1 in the erythrocyte lineage. Genes are clustered according to their motif effect score. The three most significantly enriched GO terms for each cluster are shown. Gene names are only shown for genes ascribed to these terms.
Extended Data Fig. 8
Extended Data Fig. 8. Scooby test set performance for the Epicardioids and OneK1K datasets.
Distribution of gene-level Pearson correlation between log-transformed predicted and observed counts of scRNA-seq reads overlapping exons across cell types for the Epicardiods dataset (a) and the OneK1K dataset (c). Predicted against measured between-cell-type deviations of gene expression for the Epicardiods dataset (b) and the OneK1K dataset (d).
Extended Data Fig. 9
Extended Data Fig. 9. Variant effect prediction performance on GTEx whole blood eQTLs for seq2cells and Borzoi.
a, Seq2cells predicted aggregated effects (log-fold change) vs. observed whole-blood eQTL effect sizes. Red dotted lines mark thresholds below which predicted fold-changes are deemed negligible (absolute fold change 3.5%; matching the threshold by Schwessinger et al. for comparability). Percentages quantify variants within each quadrant: blue - all variants; red - variants passing the 3.5% predicted effect threshold. b, Same as a, but for Borzoi. c, Proportion of concordant seq2cells eQTL predictions (same direction as observed), as a function of distance to the transcription start site when filtering for non-negligible predicted effect (red) or without filtering (blue). Dashed blue line indicates the mean proportion of concordant eQTL predictions across all distances (0.23). Stars indicate significance over random performance (Binomial test). d, Same as in c, but for Borzoi.
Extended Data Fig. 10
Extended Data Fig. 10. Example of a cell-type unspecific variant.
a, Predicted fold change in gene expression (top) and accessibility (bottom) between the alternative and reference alleles of variant rs62032983 in CD14+ Monocytes and Erythroblasts. Sequence attributions revealed the destruction of an ELF1 motif to affect model outputs across cell types (Methods). b, UMAP of observed normalized ELF1 expression levels.

Update of

References

    1. Sasse, A., Chikina, M. & Mostafavi, S. Unlocking gene regulation with sequence-to-function models. Nat. Methods21, 1374–1377 (2024). - PubMed
    1. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.26, 990–999 (2016). - PMC - PubMed
    1. Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep.31, 107663 (2020). - PubMed
    1. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods12, 931–934 (2015). - PMC - PubMed
    1. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.33, 831–838 (2015). - PubMed

LinkOut - more resources