Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jul 10:2024.07.05.602265.
doi: 10.1101/2024.07.05.602265.

Current genomic deep learning models display decreased performance in cell type specific accessible regions

Affiliations

Current genomic deep learning models display decreased performance in cell type specific accessible regions

Pooja Kathail et al. bioRxiv. .

Update in

Abstract

Background: A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability.

Results: We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants.

Conclusions: Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.

Keywords: Chromatin Accessibility; Deep Learning; Variant Effect Prediction.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1. Overview of data processing and model evaluation.
A) Schematic overview of the data preprocessing and evaluation pipeline used in this study. Cell type specific and ubiquitous peak sequences were annotated, and models were evaluated independently in these genomic regions. Models were evaluated on both “reference accuracy” (the models’ ability to predict experimentally measured accessibility from the reference genome) and “variant effect accuracy” (the models’ ability to predict allele-specific differences in accessibility). B) Four previously published datasets are used in subsequent analyses. The experimental assays and number of chromatin accessibility profiles are shown. Only chromatin accessibility profiles from ATAC-seq or DNase-seq are analyzed in this work. C) For each of the four datasets, the majority of test set sequences are cell type specific. Distributions shown are over test set sequences that had a peak in at least one chromatin accessibility profile in the dataset.
Fig. 2
Fig. 2. Evaluating state-of-the-art models in cell type specific peaks.
A) Cell type specific peaks from trait-associated tissues represented in the Enformer training data are enriched for trait heritability. We categorize the 684 Enformer accessibility tracks into 9 tissue categories, mirroring the categorization in [21], and divide the accessible regions (peaks) present in each tissue category into high and low cell type specificity subsets based on their overlap with peaks in the other accessibility tracks (Methods). We compute heritability enrichments using the following trait-tissue associations – Height: Musculoskeletal-connective, BMI: Central nervous system, Asthma: Blood/immune, Diabetes: Pancreas, Eczema: Blood/immune, Smoking status: Central nervous system, Heel T-score: Cardiovascular. B) Enformer’s chromatin accessibility prediction performance (reference accuracy) is poor in high cell type specificity peaks and highly accurate in low cell type specificity peaks (regions that contain a peak in greater than 300 chromatin accessibility profiles). Distributions shown are over all 684 Enformer accessibility output tracks. For the Sei model, which predicts the probability of the presence of a peak, we report the prediction AUC and AUPRC stratified by cell type specificity in Fig. S2. C) Enformer and Sei classify high posterior inclusion probability eQTLs (PIP > 0.9) versus a matched negative set of low PIP eQTLs (PIP < 0.01) (using positive and negative variant sets obtained from [7]). Both models have reduced performance when classifying eQTLs in cell type specific accessibility peaks. D) Limited discrimination of trait-associated variants by Enformer variant effect predictions. Variants in chromatin accessible regions were subset to those with high Enformer SNP Accessibility Difference (SAD) scores (top 50% of Enformer SAD scores). Enrichment of these variants for trait heritability was assessed using partitioned LD score regression. We additionally report heritability enrichment for the top 10% of variants based on Enformer SAD scores in Fig. S6
Fig. 3
Fig. 3. Multi-task accessibility prediction models of related cell types exhibit poor cell type specific peak prediction.
A) Kidney tubule cell type specific accessibility peaks are significantly enriched for heritability of the kidney function biomarker creatinine (Loeb et al. [27] data) and immune cell type specific accessibility peaks are significantly enriched for autoimmune trait heritability (Calderon et al. [28] data). B) Scatter plots of experimentally measured versus predicted accessibility in cell type specific and ubiquitous peaks for one cell type – Loop of Henle – in the Loeb et al. [27] data. Plotted points are sequences from the held out test chromosomes. C) Multi-task model reference accuracy is poor in cell type specific peaks for multi-task models trained on either the Loeb et al. [27] data or the Calderon et al. [28] data. Reference accuracy is measured as the Pearson correlation between experimentally measured versus predicted accessibility. Error bars represent the standard deviation over three replicate models. D) Multi-task model predictions across replicate models are significantly more variable for sequences in cell type specific peaks versus sequences in ubiquitous peaks. Variability is quantified as the coefficient of variation for each sequence across three model replicates (one-sided Mann-Whitney U-test with Benjamini-Hochberg multiple testing correction). E) Experimentally measured and predicted accessibility profiles from the Loeb et al. [27] data for a region around NR2F1. The ubiquitous peak near the center of the coverage track is well-predicted in all cell types by the multi-task model, while the cell type specific peak on the 5’ end of the coverage track is not predicted to be a peak in any cell type. F) Experimentally measured and predicted accessibility profiles from the Calderon et al. [28] data for a region around ERAP2. The ubiquitous peak on the 3’ end of the coverage track is well-predicted in all cell types by the multi-task model. The two cell type specific peaks towards the 5’ end of the coverage track are predicted to be peaks in all three cell types by the same model, although there is no measured accessibility in these regions in DCmye cells.
Fig. 4
Fig. 4. Increased model capacity to learn cell type specific regulatory syntax improves reference sequence prediction in cell type specific peaks.
A) Reference accuracy of multi-task versus single-task models evaluated in cell type specific peak regions. Single-task models and high capacity multi-task models tend to outperform baseline multi-task models in cell type specific peaks. Reference accuracy of the same multi-task and single-task models in ubiquitous peaks is reported in Fig. S9A. B) In cell type specific peaks, pairwise correlations of peak height between cell types are computed for experimental (dark gray) and model predicted accessibility (dark and light blue). Model predicted accessibility is more correlated between cell types than experimentally measured accessibility, and this overcorrelation is more pronounced in predictions from multi-task models than predictions from single-task models. The correlation in experimental and model predicted accessibility between cell types in ubiquitous peaks is reported in Fig. S9B. C) High SAD score variants from all three tested model types (multi-task, high capacity multi-task, and single-task) are similarly enriched for trait heritability of tissue-matched traits. Using predictions from each model, we subset the variants in high and low cell type specificity peak regions based on the model’s SNP Accessibility Difference (SAD) scores. We use the median SAD score for all variants in a particular peak set (e.g. “Kidney high cell type specificity peaks”) as a threshold to subset to high SAD score variants. Enformer’s performance on this task for the same traits is shown in Fig. S19.

References

    1. Zhou J., Troyanskaya O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015) - PMC - PubMed
    1. Kelley D.R., Snoek J., Rinn J.L.: Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26(7), 990–999 (2016) - PMC - PubMed
    1. Kelley D.R., Reshef Y.A., Bileschi M., Belanger D., McLean C.Y., Snoek J.: Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28(5), 739–750 (2018) - PMC - PubMed
    1. Zhou J., Theesfeld C.L., Yao K., Chen K.M., Wong A.K., Troyanskaya O.G.: Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50(8), 1171–1179 (2018) - PMC - PubMed
    1. Agarwal V., Shendure J.: Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31(7), 107663 (2020) - PubMed

Publication types

LinkOut - more resources