Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Mar 13:2024.09.19.613754.
doi: 10.1101/2024.09.19.613754.

scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution

Affiliations

scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution

Johannes C Hingerl et al. bioRxiv. .

Abstract

Understanding how regulatory DNA elements shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA-seq and epigenomic profiling provides opportunities to build unifying models of gene regulation capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multi-modal technologies. Here, we introduce scooby, the first framework to model scRNA-seq coverage and scATAC-seq insertion profiles along the genome from sequence at single-cell resolution. For this, we leverage the pre-trained multi-omics profile predictor Borzoi as a foundation model, equip it with a cell-specific decoder, and fine-tune its sequence embeddings. Specifically, we condition the decoder on the cell position in a precomputed single-cell embedding resulting in strong generalization capability. Applied to a hematopoiesis dataset, scooby recapitulates cell-specific expression levels of held-out genes, and identifies regulators and their putative target genes through in silico motif deletion. Moreover, accurate variant effect prediction with scooby allows for breaking down bulk eQTL effects into single-cell effects and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. scooby accurately predicts cell-state-specific expression and accessibility profiles from single-cell data.
a, scooby integrates a pre-trained sequence-to-profile model with a cell-state-specific decoder to model genomic profiles at single-cell resolution. The pre-trained model is fine-tuned on the target dataset using a parameter-efficient strategy, generating an adapted sequence embedding at 32 bp resolution. The cell-state-specific decoder takes this sequence embedding together with embedding vectors of single cells as input to predict scATAC-seq insertion and scRNA-seq coverage profiles at single-cell level. b, Uniform Manifold Approximation and Projection (UMAP) visualization of the 10x multiome NeurIPS bone marrow dataset integrated with Poisson-MultiVI, colored by cell type. c, Representative example of predicted and observed gene expression (top) and accessibility (bottom) profiles of an erythroblast and a megakaryocyte-erythroid progenitor cell and its 100 nearest neighbors at the SLC25A37 locus (part of the test set). d, Distribution of the correlation between predicted and observed profiles on test sequences (n = 210 representative cells, Methods). Box plots showing the distribution of Pearson R values for different comparisons of single-cell ATAC-seq and RNA-seq profiles. Comparisons include single cells versus corresponding pseudobulk, single cells versus scooby, single cells versus 100 nearest neighbors and 100 nearest neighbors versus scooby. All pairwise comparisons per assay are statistically significant (two-sided Wilcoxon rank-sum test, P < 5x10−20). In all boxplots, the central line denotes the median, boxes represent the interquartile range (IQR) and whiskers show the distribution except for outliers. Outliers are all points outside 1.5 × IQR. B, B cell; T, T cell; Mono, Monocyte; prog, progenitor; HSC, Hematopoietic stem cell; ILC, Innate lymphoid cell; Lymph, Lymphoid; MK/E, Megakaryocyte and Erythrocyte; G/M, Granulocyte and Myeloid; NK, Natural Killer cell; cDC2, Classical dendritic cell type 2; pDCs, Plasmacytoid dendritic cells.
Figure 2:
Figure 2:. scooby accurately models cell-type-specific gene expression counts and generalizes to unseen cell states.
a, Predicted and observed profiles are aggregated into a gene expression count matrix by summing coverage over exons. We obtain pseudobulk counts by summing over all predictions of every cell for each cell type. b, Normalized gene expression matrix (Methods) for cell-type-specific genes, observed (top) and predicted (bottom). Each row is a marker gene from test (black) or validation (gray), each column is a randomly selected cell. Cells are grouped by cell type (bottom track) c, We evaluate scooby's performance using two metrics: the correlation between predicted and observed gene expression counts within each cell type (left) and the model's ability to capture cell-type-specific deviations of gene expression to gene mean (right). d, Distribution of gene-level Pearson correlation between log-transformed predicted and observed counts of scRNA-seq reads overlapping exons across cell types. e, Predicted against measured between-cell-type deviations of gene expression. Exemplarily highlighted combinations of marker gene and cell show strong deviations from the mean expression level. f, Across-gene Pearson correlation between log-transformed predicted and observed normoblast gene expression counts using an ablated model which was not trained on normoblast cells. Each bar corresponds to predictions done using the single-cell embeddings of cells of a different cell type. g, Mean-normalized observed and predicted gene expression of HEMGN along the diffusion pseudotime axis representing erythropoietic differentiation. Both the full and the no-normoblast model accurately recapitulate the expression dynamics. Dots are colored by cell type, lines are smoothed with a rolling mean (window size: 200 cells).
Figure 3:
Figure 3:. In silico motif mutation enables TF motif effect scoring and reveals lineage and cell-state-specific regulators.
a, Schematic of TF motif effect scoring via in silico motif mutation. b, Pearson correlation of TF motif effect score with TF expression for scooby against chromVAR. The gray area marks the zone of improvement. c, Same as b for a scooby trained on scRNA-seq only. d, Heatmap of average TF motif effect score per TF family (columns) across cell types (rows). e, Median-normalized effect of GATA1 in silico motif mutation on accessibility and expression (top) and median-normalized expression of GATA1 along the diffusion pseudotime axis representing erythropoietic differentiation (bottom). Dots are colored by cell type, lines are smoothed with a rolling mean (window size: 200 cells). f, UMAP visualization of multiomic metacells obtained from paired scRNA-seq and scATAC-seq data of epicardioid cells across multiple days, colored by cell type. The juxta-cardiac field progenitor cells (JCF cells, circle) and their transitions (arrows) to their two descendant cell types cardiomyocytes and epicardial cells are highlighted. g, CellRank transition probabilities towards epicardial and cardiomyocyte states within the JCF population. h, Correlation of TF motif effect scores with transition probability towards the cardiomyocyte (blue) and epicardial fate (yellow). i, Min-max scaled TF motif effect scores of GATA4 (left) and FOS (right) in the JCF cluster.
Figure 4:
Figure 4:. scooby predicted variant effects are concordant with reported effects for bulk and single-cell eQTL studies and exhibit cell-type specificity.
a, Spearman correlation of predicted effects (log-fold change) with observed normalized eQTL effects for scooby against Borzoi. Each point indicates a cell type (OneK1K) or a tissue (GTEx). Dashed line marks the y=x line. Scooby significantly outperforms track-matched Borzoi across the OneK1K cell types (Wilcoxon rank-sum test, P = 0.001). b, Same as a, but for scooby against seq2cells. Scooby significantly outperforms seq2cells (Wilcoxon rank-sum, P = 5x10−4). c, Predicted aggregated effects (log-fold change) vs. observed whole-blood eQTL effect sizes. Red dotted lines mark thresholds below which predicted fold-changes are deemed negligible (absolute fold change 3.5%; matching the threshold by Schwessinger et al. for comparability). Percentages quantify variants within each quadrant: blue - all variants; red - variants passing the 3.5% predicted effect threshold. d, Proportion of concordant eQTL predictions (same direction as observed), as a function of distance to the transcription start site when filtering for non-negligible predicted effect (red) or without filtering (blue). Dashed blue line indicates the mean proportion of concordant eQTL predictions across all distances (0.23). Stars indicate significance over random performance (Binomial test). e, Schematic of the cell-type specificity evaluation. Cell-type-specific effects are either obtained from model predictions, the cell-specific accessibility in the peak closest to the variant position, or via the pseudobulk expression levels of the eGene. For each approach, cell types are ranked by the absolute magnitude of the effect to distinguish cell types with and without fine-mapped eQTL associations. f, Precision in recovering cell types with fine-mapped eQTL associations when considering the top k most highly ranked cell types using different ranking methods from e. Stars indicate significance when comparing Scooby to the target gene expression baseline (Two-sided Fisher exact test, P < 0.05)
Figure 5:
Figure 5:. scooby allows cell-type-specific delineation of bulk eQTLs.
a, Clustermap of eQTL effect size predictions across cell types. Left color bar indicates lineage membership. Genes were clustered according to their predicted effect size per cell type. Highlighted genes (and their fine-mapped eQTLs) have predicted variable effects (black) and were considered for an overlap with the GWAS Catalog. eQTL-GWAS term matches are colored in red. b, Heatmap of gene-variant pairs with strong cell-type-specific effects and matching GWAS terms. c, Predicted fold change in gene expression (top) and accessibility (bottom) between the alternative and reference alleles of variant rs143664050 in CD14+ Monocytes and Erythroblasts. Sequence attributions revealed the destruction of a SPI1 motif to only affect model outputs in CD14+ Monocytes (Methods). d, UMAP of the NeurIPS dataset colored by observed normalized SPI1 expression levels. e, UMAP of variant rs143664050 effect on TES expression levels.

Similar articles

References

    1. Sasse A., Chikina M. & Mostafavi S. Unlocking gene regulation with sequence-to-function models. Nat. Methods 21, 1374–1377 (2024). - PubMed
    1. Kelley D. R., Snoek J. & Rinn J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016). - PMC - PubMed
    1. Agarwal V. & Shendure J. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep. 31, (2020). - PubMed
    1. Zhou J. & Troyanskaya O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015). - PMC - PubMed
    1. Alipanahi B., Delong A., Weirauch M. T. & Frey B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015). - PubMed

Publication types

LinkOut - more resources