. 2025 Sep 23;53(18):gkaf940.

doi: 10.1093/nar/gkaf940.

Ultra-fast variant effect prediction using biophysical transcription factor binding models

Rezwan Hosseini¹, Ali Tugrul Balci², Dennis Kostka¹, Nathan Clark³, Maria Chikina¹

Affiliations

¹ Department of Computation and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States.
² Genentech Inc., Computational Biology and Translation, South San Francisco, CA 94080, United States.
³ Department of Biological Science, University of Pittsburgh, Pittsburgh, PA 15260, United States.

PMID: 41063341
PMCID: PMC12507518
DOI: 10.1093/nar/gkaf940

Ultra-fast variant effect prediction using biophysical transcription factor binding models

Rezwan Hosseini et al. Nucleic Acids Res. 2025.

. 2025 Sep 23;53(18):gkaf940.

doi: 10.1093/nar/gkaf940.

Authors

Rezwan Hosseini¹, Ali Tugrul Balci², Dennis Kostka¹, Nathan Clark³, Maria Chikina¹

Affiliations

¹ Department of Computation and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States.
² Genentech Inc., Computational Biology and Translation, South San Francisco, CA 94080, United States.
³ Department of Biological Science, University of Pittsburgh, Pittsburgh, PA 15260, United States.

PMID: 41063341
PMCID: PMC12507518
DOI: 10.1093/nar/gkaf940

Abstract

Sequence variation within transcription factor (TF)-binding sites can significantly affect TF-DNA interactions, influencing gene expression and contributing to disease susceptibility or phenotypic traits. Despite recent progress in deep sequence-to-function models that predict functional output from sequence data, these methods perform inadequately on some variant effect prediction tasks, especially with common genetic variants. This limitation underscores the importance of leveraging biophysical models of TF binding to enhance interpretability of variant effect scores and facilitate mechanistic insights. We introduce motifDiff, a novel computational tool designed to quantify variant effects using mono- and dinucleotide position weight matrices. motifDiff offers several key advantages, including scalability to score millions of variants within minutes, implementation of statistically rigorous normalization strategy critical for optimal performance, and support for both dinucleotide and mononucleotide models. We demonstrate motifDiff's efficacy by evaluating it across diverse ground truth datasets that quantify the effects of common variants in vivo, thereby establishing robust benchmarks for the predictive value of variant effect calculations. Finally, we show that our tool provides unique insights when interpreting human accelerated regions. motifDiff is available as a standalone Python application at https://github.com/rezwanhosseini/MotifDiff.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
(A) Overview of *motifDiff* including four steps: (1) scanning the one-hot-encoded sequence (REF/ALT) by the log-transformed PPM of a TFBS, and calculating the mathematical convolution between them, (2) mapping the calculated convolution scores (on x-axis) to probabilities (on y-axis) based on the cumulative distribution function (c.d.f) of the motif (normalization), (3) pooling whether to choose the maximum (best match part of the sequence to the motif) or the average (average occupancy of the sequence in the motif distribution) of the mapped scores, and (4) getting the difference between the pooled scores from REF and ALT sequences to see if the best match or average occupancy of the sequence is increased (constructive) or decreased (destructive) by the variant. (B) The second phase of analysis for benchmarking, which takes a set of variants along with a set of motif PWMs and calculates the difference in binding of the REF and ALT sequence surrounding each variant. These generated Diffs (single motif binding effects) are then used as features to train an Elastic Net regression model for fine-tuned prediction of the variant effect.

**Figure 2.**
Comparing the performance in fine-tuned predictions by Elastic-Net from single motif predictions calculated by four methods: No-Normalization (raw log-odds diff), FABIAN, probNorm-AvgPooled, and probNorm-MaxPooled for variants in the ADASTRA dataset [13] separated by their target TF. The purple bar corresponds to the fine-tuned predictions from single motif variant effect features produced by the deep learning model Sei [2] and is included for reference as an upper bound on predictive performance. Panel (A) shows correlation between the true and predicted values from the regression task, and panel (B) shows AUROC between the true and predicted direction of effect from the classification task.

**Figure 3.**
(A) Correlation between the z-score like values from ADASTRA and the single motif binding effects calculated by the four previously introduced methods, plus motifbreakR [8]. In each variant set, only the column corresponding to the motif for the target TF is extracted to calculate its correlation with the z-score like value. The correlation between the logFC reported by ADASTRA and the z-score like value is also shown in gray. The correlations from Sei fine-tuned predictions are included for reference. (B) The gap between single-motif (referred to as Diff) and fine-tuned (referred to as ENet) predictions. (C) single-motif effects as features that are selected in the variant set targeting the binding of IKZF1 by the elastic-net model and their weights on y-axis. Note that IKZF1 motif itself is not selected, which explains the large performance gain after fine-tuning.

**Figure 4.**
(A) Correlation and (B) AUROC from the fine-tuned predictions by elastic-net on caQTLs, bQTLs and UDACHA variant set. Only variants with concordant effect across population were considered in caQTLs dataset. In bQTLs dataset, variants are separated by their target TF and Histone Modification. The variants in UDACHA datasets are separated by the experiment and here were filtered to only include the ones with P-value ≤ 0.001.

**Figure 5.**
Using average pooling produces up to 40% improvement on effect sizes value prediction in some cases (A), and small but consistent improvement on all effect size sign prediction tasks (B).

**Figure 6.**
Comparing mono- and dinucleotide models on ADASTRA TF-binding predictions. Each point represents the performance (Pearson correlation) of a specific method on a specific TF-associated variant set. Panel (A) colors the points by method, showing variation across the four tested approaches (FABIAN-MaxPooled, No-Normalization, probNorm-AvgPooled, and probNorm-MaxPooled). Panel (B) colors the same points by motif, corresponding to the ten ADASTRA variant sets labeled by TF (e.g. FOXA1 and CTCF). Thus, each variant set appears as four points (one per method), and each method appears across ten variant sets.

**Figure 7.**
Runtimes for variant effect prediction methods plotted on the log-scale (left panel) and linear scale (right panel, excluding Sei).

**Figure 8.**
probNorm Diff scores show stronger gain signals compared to Diff scores from Sei. All panels show the positive distribution shift of Diff scores for each TF from Normal distribution on x-axis and the significance of the distribution shift on y-axis. Since under the null hypothesis the EPPS equals 0.5, we centered it by subtracting 0.5 to clearly indicate binding gains as positive values. (A) probNorm Diffs normalized with Max Pooling (B) raw log-odd Diffs with no normalization, and (C) Sei Diffs only on TF tracks extracted from Chromatin Profiles.

See this image and copyright information in PMC

References

1. Zhou J, Troyanskaya OG Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015; 12:931–4. 10.1038/nmeth.3547. - DOI - PMC - PubMed
1. Chen KM, Wong AK, Troyanskaya OG et al. A sequence-based global map of regulatory activity for deciphering human genetics. Nat Genet. 2022; 54:940–9. 10.1038/s41588-022-01102-2. - DOI - PMC - PubMed
1. Kelley DR, Reshef YA, Bileschi M et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018; 28:739. 10.1101/gr.227819.117. - DOI - PMC - PubMed
1. Avsec Ž, Agarwal V, Visentin D et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021; 18:1196–203. 10.1038/s41592-021-01252-x. - DOI - PMC - PubMed
1. Sasse A, Ng B, Spiro AE et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet. 2023; 55:2060–4. 10.1038/s41588-023-01524-6. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

University of Pittsburgh

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ultra-fast variant effect prediction using biophysical transcription factor binding models

Affiliations

Ultra-fast variant effect prediction using biophysical transcription factor binding models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous