Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Dec 23:2023.12.21.572730.
doi: 10.1101/2023.12.21.572730.

Characterizing uncertainty in predictions of genomic sequence-to-activity models

Affiliations

Characterizing uncertainty in predictions of genomic sequence-to-activity models

Ayesha Bajwa et al. bioRxiv. .

Abstract

Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Reference genome predictions are largely consistent across replicates, even when incorrect.
(a) For each of the 5,313 Basenji2 tracks, we report the Pearson correlation between predictions from replicate models 1 and 2 on reference genome sequences held out during training. Reference genome predictions are highly correlated across the two replicates, and predictions for CAGE tracks are less consistent than for other assays. (b) For genes held out during training, we display the distribution of Pearson correlations between gradient saliency maps from all pairs of replicates. The red dashed line indicates the mean Pearson correlation. (c) We report the proportion of test sequences that are classified consistently correctly, inconsistently, or consistently incorrectly in a binary peak prediction task. We distinguish performance on all bins (left), peaks (center), and peaks at transcription start sites (right).
Figure 2:
Figure 2:. Replicates display greater inconsistency at predicting effects of mutations in TF motifs compared to canonical TF motifs.
(a) Graphical depiction of the TF activity score and TF mutation activity score calculations. TF activity scores were calculated by comparing model predictions for endogenous background sequences versus background sequences with a canonical TF motif inserted at a fixed location upstream of a gene’s TSS. TF mutation activity scores were calculated by comparing model predictions for canonical motif-inserted sequences versus sequences where a single base-pair mutation was made to the canonical motif. Scores were averaged over 100 different background sequences. (b) We compute TF activity scores and TF mutation activity scores by inserting motifs at four different fixed positions–10bp, 100bp, 1000bp, and 10,000bp–upstream of each selected gene’s TSS. For all Basenji2 prediction tracks, we report the fraction of TFs with inconsistent predicted directional effects across model replicates for both TF activity scores (canonical motifs versus background sequences) and TF mutation activity scores (mutated motifs versus canonical motifs). In all but two cases, we observe greater inconsistency in TF mutation activity scores than TF activity scores (one-sided Mann-Whitney U test, **** indicates a Benjamini-Hochberg corrected p-value < 1e-4)
Figure 3:
Figure 3:. eQTL sign predictions from tissue-matched CAGE tracks have high inconsistency across replicates.
(a) We show the proportion of fine-mapped eQTLs with consistently correct, inconsistent, and consistently incorrect eQTL effect sign predictions across replicates. About 55% of eQTLs have inconsistent predictions, while 29% are consistently correct and 16% are consistently incorrect. (b) eQTLs with inconsistent sign predictions tend to have smaller predicted effect sizes (mean of SAD score magnitude across replicates) in the tissue-matched CAGE track. (c) A comparison of accuracy for eQTL sign prediction shows that the ensemble majority vote does not substantially or consistently outperform a single replicate. Each point represents the fine-mapped eQTL set of a different tissue.
Figure 4:
Figure 4:. Uncertainty in predictions on personal genome sequences.
(a) We segment the 3259 genes analyzed in Huang et al. [10] by the number of replicates with cross-individual correlations greater than 0. For each gene, the cross-individual correlation is calculated as the Spearman correlation between model predictions and measured RNA-seq across individuals. (b) We compare the cross-individual Spearman correlations of replicates 1 and 2. (c) We compare the cross-individual Spearman correlations of replicate 1 and an ensemble model. The ensemble model averages an individual’s predicted expression rank across all five replicates. (d) We show the number of drivers that are identified reproducibly in multiple replicates, stratified by whether these drivers are found in genes with high or low uncertainty. (e) We plot the number of replicates (3-5) that agree on the SAD score sign of a driver, stratified both by the number of replicates that identify the driver and if the driver was found in an high or low uncertainty gene.

References

    1. Zhou Jian and Troyanskaya Olga G. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931–934, 2015. - PMC - PubMed
    1. Zhou Jian, Theesfeld Chandra L, Yao Kevin, Chen Kathleen M, Wong Aaron K, and Troyanskaya Olga G. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics, 50 (8):1171–1179, 2018. - PMC - PubMed
    1. Agarwal Vikram and Shendure Jay. Predicting mrna abundance directly from genomic sequence using deep convolutional neural networks. Cell reports, 31(7), 2020. - PubMed
    1. Kelley David R.. Cross-species regulatory sequence activity prediction. PLOS Computational Biology, 16(7): e1008050, July 2020. doi: 10.1371/journal.pcbi.1008050. URL 10.1371/journal.pcbi.1008050. - DOI - DOI - PMC - PubMed
    1. Avsec Žiga, Agarwal Vikram, Visentin Daniel, Ledsam Joseph R., Grabska-Barwinska Agnieszka, Taylor Kyle R., Assael Yannis, Jumper John, Kohli Pushmeet, and Kelley David R.. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, October 2021. doi: 10.1038/s41592-021-01252-x. URL 10.1038/s41592-021-01252-x. - DOI - DOI - PMC - PubMed

Publication types