This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jul 10:2024.07.05.602265.

doi: 10.1101/2024.07.05.602265.

Current genomic deep learning models display decreased performance in cell type specific accessible regions

Pooja Kathail¹, Richard W Shuai², Ryan Chung¹, Chun Jimmie Ye^{3

4

5

6

7

8}, Gabriel B Loeb^{9

10}, Nilah M Ioannidis^{1

2

8}

Affiliations

¹ Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
³ Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
⁴ Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
⁵ Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
⁶ Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
⁷ Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA.
⁸ Chan Zuckerberg Biohub, San Francisco, CA, USA.
⁹ Division of Nephrology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
¹⁰ Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA.

PMID: 39026761
PMCID: PMC11257480
DOI: 10.1101/2024.07.05.602265

Current genomic deep learning models display decreased performance in cell type specific accessible regions

Pooja Kathail et al. bioRxiv. 2024.

[Preprint]. 2024 Jul 10:2024.07.05.602265.

doi: 10.1101/2024.07.05.602265.

Authors

Pooja Kathail¹, Richard W Shuai², Ryan Chung¹, Chun Jimmie Ye^{3

4

5

6

7

8}, Gabriel B Loeb^{9

10}, Nilah M Ioannidis^{1

2

8}

Affiliations

¹ Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
³ Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
⁴ Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
⁵ Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
⁶ Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
⁷ Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA.
⁸ Chan Zuckerberg Biohub, San Francisco, CA, USA.
⁹ Division of Nephrology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA.
¹⁰ Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA.

PMID: 39026761
PMCID: PMC11257480
DOI: 10.1101/2024.07.05.602265

Update in

Current genomic deep learning models display decreased performance in cell type-specific accessible regions.
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Kathail P, et al. Genome Biol. 2024 Aug 1;25(1):202. doi: 10.1186/s13059-024-03335-2. Genome Biol. 2024. PMID: 39090688 Free PMC article.

Abstract

Background: A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability.

Results: We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants.

Conclusions: Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.

Keywords: Chromatin Accessibility; Deep Learning; Variant Effect Prediction.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare that they have no competing interests.

Figures

**Fig. 1. Overview of data processing and model evaluation.**
A) Schematic overview of the data preprocessing and evaluation pipeline used in this study. Cell type specific and ubiquitous peak sequences were annotated, and models were evaluated independently in these genomic regions. Models were evaluated on both “reference accuracy” (the models’ ability to predict experimentally measured accessibility from the reference genome) and “variant effect accuracy” (the models’ ability to predict allele-specific differences in accessibility). B) Four previously published datasets are used in subsequent analyses. The experimental assays and number of chromatin accessibility profiles are shown. Only chromatin accessibility profiles from ATAC-seq or DNase-seq are analyzed in this work. C) For each of the four datasets, the majority of test set sequences are cell type specific. Distributions shown are over test set sequences that had a peak in at least one chromatin accessibility profile in the dataset.

**Fig. 2. Evaluating state-of-the-art models in cell type specific peaks.**
A) Cell type specific peaks from trait-associated tissues represented in the Enformer training data are enriched for trait heritability. We categorize the 684 Enformer accessibility tracks into 9 tissue categories, mirroring the categorization in [21], and divide the accessible regions (peaks) present in each tissue category into high and low cell type specificity subsets based on their overlap with peaks in the other accessibility tracks (Methods). We compute heritability enrichments using the following trait-tissue associations – Height: Musculoskeletal-connective, BMI: Central nervous system, Asthma: Blood/immune, Diabetes: Pancreas, Eczema: Blood/immune, Smoking status: Central nervous system, Heel T-score: Cardiovascular. B) Enformer’s chromatin accessibility prediction performance (reference accuracy) is poor in high cell type specificity peaks and highly accurate in low cell type specificity peaks (regions that contain a peak in greater than 300 chromatin accessibility profiles). Distributions shown are over all 684 Enformer accessibility output tracks. For the Sei model, which predicts the probability of the presence of a peak, we report the prediction AUC and AUPRC stratified by cell type specificity in Fig. S2. C) Enformer and Sei classify high posterior inclusion probability eQTLs (PIP > 0.9) versus a matched negative set of low PIP eQTLs (PIP < 0.01) (using positive and negative variant sets obtained from [7]). Both models have reduced performance when classifying eQTLs in cell type specific accessibility peaks. D) Limited discrimination of trait-associated variants by Enformer variant effect predictions. Variants in chromatin accessible regions were subset to those with high Enformer SNP Accessibility Difference (SAD) scores (top 50% of Enformer SAD scores). Enrichment of these variants for trait heritability was assessed using partitioned LD score regression. We additionally report heritability enrichment for the top 10% of variants based on Enformer SAD scores in Fig. S6

**Fig. 3. Multi-task accessibility prediction models of related cell types exhibit poor cell type specific peak prediction.**
A) Kidney tubule cell type specific accessibility peaks are significantly enriched for heritability of the kidney function biomarker creatinine (Loeb et al. [27] data) and immune cell type specific accessibility peaks are significantly enriched for autoimmune trait heritability (Calderon et al. [28] data). B) Scatter plots of experimentally measured versus predicted accessibility in cell type specific and ubiquitous peaks for one cell type – Loop of Henle – in the Loeb et al. [27] data. Plotted points are sequences from the held out test chromosomes. C) Multi-task model reference accuracy is poor in cell type specific peaks for multi-task models trained on either the Loeb et al. [27] data or the Calderon et al. [28] data. Reference accuracy is measured as the Pearson correlation between experimentally measured versus predicted accessibility. Error bars represent the standard deviation over three replicate models. D) Multi-task model predictions across replicate models are significantly more variable for sequences in cell type specific peaks versus sequences in ubiquitous peaks. Variability is quantified as the coefficient of variation for each sequence across three model replicates (one-sided Mann-Whitney U-test with Benjamini-Hochberg multiple testing correction). E) Experimentally measured and predicted accessibility profiles from the Loeb et al. [27] data for a region around *NR2F1*. The ubiquitous peak near the center of the coverage track is well-predicted in all cell types by the multi-task model, while the cell type specific peak on the 5’ end of the coverage track is not predicted to be a peak in any cell type. F) Experimentally measured and predicted accessibility profiles from the Calderon et al. [28] data for a region around *ERAP2*. The ubiquitous peak on the 3’ end of the coverage track is well-predicted in all cell types by the multi-task model. The two cell type specific peaks towards the 5’ end of the coverage track are predicted to be peaks in all three cell types by the same model, although there is no measured accessibility in these regions in DCmye cells.

**Fig. 4. Increased model capacity to learn cell type specific regulatory syntax improves reference sequence prediction in cell type specific peaks.**
A) Reference accuracy of multi-task versus single-task models evaluated in cell type specific peak regions. Single-task models and high capacity multi-task models tend to outperform baseline multi-task models in cell type specific peaks. Reference accuracy of the same multi-task and single-task models in ubiquitous peaks is reported in Fig. S9A. B) In cell type specific peaks, pairwise correlations of peak height between cell types are computed for experimental (dark gray) and model predicted accessibility (dark and light blue). Model predicted accessibility is more correlated between cell types than experimentally measured accessibility, and this overcorrelation is more pronounced in predictions from multi-task models than predictions from single-task models. The correlation in experimental and model predicted accessibility between cell types in ubiquitous peaks is reported in Fig. S9B. C) High SAD score variants from all three tested model types (multi-task, high capacity multi-task, and single-task) are similarly enriched for trait heritability of tissue-matched traits. Using predictions from each model, we subset the variants in high and low cell type specificity peak regions based on the model’s SNP Accessibility Difference (SAD) scores. We use the median SAD score for all variants in a particular peak set (e.g. “Kidney high cell type specificity peaks”) as a threshold to subset to high SAD score variants. Enformer’s performance on this task for the same traits is shown in Fig. S19.

See this image and copyright information in PMC

References

1. Zhou J., Troyanskaya O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015) - PMC - PubMed
1. Kelley D.R., Snoek J., Rinn J.L.: Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26(7), 990–999 (2016) - PMC - PubMed
1. Kelley D.R., Reshef Y.A., Bileschi M., Belanger D., McLean C.Y., Snoek J.: Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28(5), 739–750 (2018) - PMC - PubMed
1. Zhou J., Theesfeld C.L., Yao K., Chen K.M., Wong A.K., Troyanskaya O.G.: Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50(8), 1171–1179 (2018) - PMC - PubMed
1. Agarwal V., Shendure J.: Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31(7), 107663 (2020) - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Current genomic deep learning models display decreased performance in cell type specific accessible regions

Affiliations

Current genomic deep learning models display decreased performance in cell type specific accessible regions

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources