This is a preprint.
"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction
- PMID: 40501927
- PMCID: PMC12154900
- DOI: 10.1101/2025.05.21.655414
"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction
Abstract
Motivation: Understanding how DNA sequence encodes gene regulation remains a central challenge in genomics. While deep learning models can predict regulatory activity from sequence with high accuracy, their generalizability across species-and thus their ability to capture fundamental biological principles-remains limited. Cross-species prediction provides a powerful test of model robustness and offers a window into conserved regulatory logic, but effectively bridging species-specific genomic differences remains a major barrier.
Results: We present MORALE, a novel and scalable domain adaptation framework that significantly advances cross-species prediction of transcription factor (TF) binding. By aligning statistical moments of sequence embeddings across species, MORALE enables deep learning models to learn species-invariant regulatory features without requiring adversarial training or complex architectures. Applied to multi-species TF ChIP-seq datasets, MORALE achieves state-of-the-art performance-outperforming both baseline and adversarial approaches across all TFs-while preserving model interpretability and recovering canonical motifs with greater precision. In the five-species transfer setting, MORALE not only improves human prediction accuracy beyond human-only training but also reveals regulatory features conserved across mammals. These results highlight the potential of simple yet powerful domain adaptation techniques to drive generalization and discovery in regulatory genomics. Crucially, MORALE is architecture-agnostic and can be seamlessly integrated into any embedding-based sequence model.
Availability: Code is available at https://github.com/loudrxiv/frustrating.
Conflict of interest statement
7Competing interests No competing interest is declared.
Figures






References
-
- Pampari Anusri et al. “ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants”. In: bioRxiv (Jan. 2025), p. 2024.12.25.630221. eprint: 2024.12.25.630221. URL: 10.1101/2024.12.25.630221. - DOI
-
- Brixi Garyk et al. “Genome modeling and design across all domains of life with Evo 2”. In: bioRxiv (2025). DOI: 10.1101/2025.02.18.638918. eprint: https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918.full.pdf. URL: https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918. - DOI
-
- Patel Aman et al. “DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA”. In: arXiv (Dec. 2024). DOI: 10.48550/arXiv.2412.05430. eprint: 2412.05430. - DOI
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous