This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Aug 5:2025.05.21.655414.

doi: 10.1101/2025.05.21.655414.

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction

Mark Maher Ebeid¹, Ali Tuğrul Balcı¹, Maria Chikina¹, Panayiotis V Benos^{1

2}, Dennis Kostka¹

Affiliations

¹ Department of Computational & Systems Biology University of Pittsburgh School of Medicine and Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, University of Pittsburgh, Pittsburgh, PA, USA.
² Department of Epidemiology, University of Florida, Gainsville, FL, USA.

PMID: 40501927
PMCID: PMC12154900
DOI: 10.1101/2025.05.21.655414

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction

Mark Maher Ebeid et al. bioRxiv. 2025.

[Preprint]. 2025 Aug 5:2025.05.21.655414.

doi: 10.1101/2025.05.21.655414.

Authors

Mark Maher Ebeid¹, Ali Tuğrul Balcı¹, Maria Chikina¹, Panayiotis V Benos^{1

2}, Dennis Kostka¹

Affiliations

¹ Department of Computational & Systems Biology University of Pittsburgh School of Medicine and Joint Carnegie Mellon-University of Pittsburgh Ph.D. Program in Computational Biology, University of Pittsburgh, Pittsburgh, PA, USA.
² Department of Epidemiology, University of Florida, Gainsville, FL, USA.

PMID: 40501927
PMCID: PMC12154900
DOI: 10.1101/2025.05.21.655414

Abstract

Motivation: Understanding how DNA sequence encodes gene regulation remains a central challenge in genomics. While deep learning models can predict regulatory activity from sequence with high accuracy, their generalizability across species-and thus their ability to capture fundamental biological principles-remains limited. Cross-species prediction provides a powerful test of model robustness and offers a window into conserved regulatory logic, but effectively bridging species-specific genomic differences remains a major barrier.

Results: We present MORALE, a novel and scalable domain adaptation framework that significantly advances cross-species prediction of transcription factor (TF) binding. By aligning statistical moments of sequence embeddings across species, MORALE enables deep learning models to learn species-invariant regulatory features without requiring adversarial training or complex architectures. Applied to multi-species TF ChIP-seq datasets, MORALE achieves state-of-the-art performance-outperforming both baseline and adversarial approaches across all TFs-while preserving model interpretability and recovering canonical motifs with greater precision. In the five-species transfer setting, MORALE not only improves human prediction accuracy beyond human-only training but also reveals regulatory features conserved across mammals. These results highlight the potential of simple yet powerful domain adaptation techniques to drive generalization and discovery in regulatory genomics. Crucially, MORALE is architecture-agnostic and can be seamlessly integrated into any embedding-based sequence model.

Availability: Code is available at https://github.com/loudrxiv/frustrating.

PubMed Disclaimer

Conflict of interest statement

7Competing interests No competing interest is declared.

Figures

**Figure 1:. Schematic overview of domain adaptation framework with MORALE compared to leading alternatives.**
**(A)** We one hot encode input sequences from the source and target domains (i.e., human and mouse) such that we can create a latent embedding after a set of convolution layers, pooling, and autoregressive components. The embedding is then used for all pertinent tasks downstream (classification and domain-adaptive). We evaluate on three procedures: (1) Training a naive model on source data only, to predict in the target domain, (2) Leveraging an adversarial, discriminator approach to penalize learned gradients, and (3) Leveraging a loss that operates on the moment alignment of all domains of data, which is added to the total loss. In **(B)** we describe the setup of the different approaches in more detail, including that of MORALE: (1) Under the naive, source-only model, we input the relevant part of the batch (the embedded features coming from the source domain) into a classifier head to predict the label for the corresponding sites. After training, this can be evaluated on the target domain. (2) The GRL approach adds a separate branch to the overall scheme. We still predict labels during training on the source data, but now also feed in the source and target data (in the batch) to a classifier whose purpose is to predict the domain ( $U$ ) the data comes from. This learned gradient is then penalized to encourage the learning of an invariant representation for downstream target evaluation. (3) Finally, we showcase MORALE. We use one branch to predict labels on the source features during training. We also align the moments of the data between the source and target in an intermediary stage. In order to have an effect on training, this moment alignment loss is added to the overall model loss during training.

**Figure 2:. Moment alignment improves cross-species TF binding site predictions.**
For four TFs and human as the target species, prediction performance is shown for four models: (1) Source-trained on target (red), (2) gradient reversal (green), (3) moment alignment (blue), and (4) and target-trained on target (purple). **(A)** We display the results across the four transcription factors when adapting mouse model to human data. MORALE outperforms or matches the GRL performance in each case, while not suffering from degradation. **(B)** The same as in (A), but in the other adaption direction, human adapted models to mouse. We find that the degradation is persistent in this study under the scope of using gradient reversal. MORALE is able to at least meet source-trained baselines, or outperform the GRL.

**Figure 3:**
Pearson correlation coefficients between MORALE’s importance scores and the “target” model (x-axes) and GRL’s importance scores and the “target” model (y-axes) for differentially false positive sites (dFPs) and differentially false negative sites (dFNs), see text for details. Panel A shows the analyses for mouse-to-human and panel B for human-to-mouse. P-values for a one-sided Wilcoxon Rank Sum Test are indicated for each plot.

**Figure 4:. MORALE discovers de-novo motifs more similar to CTCF.**
Using calculated attribution scores across source, source-adapted, and target models, we compare the de-novo motifs found across 2000 randomly samples bound sites for CTCF. We display the output of TF-MoDISCo in the following format. In **(A)** we show the proportions of all found motifs across the four models, notably, the target model (human-on-human) only finds CTCF (and CTCFL, a paralog). The source-trained model, along with all source-adapted models, in the majority, do find CTCF primarily, however both find other motifs in these CTCF bound sites at a higher proportion than MORALE, which nearly only report CTCF. In **(B)** We display the top 5 de-novo motifs found by the source trained model, with annotated p-values and the corresponding TomTom matches on the right, y-axes. The same is done for **(C)**, **(D)**, and **(E)**. The top match of MORALE strongly resembles the established CTCF motif with a significant q-value.

**Figure 5:. MORALE attains higher performance when training on multiple source species and predicting in human as the target.**
Across the four transcription factors tested, leveraging MORALE unilaterally increases performance — compared to no domain adaptation at all. Of note, unlike the two-species case, training on multiple source species allows us to outperform the human-on-human model, previously acting as an upper bound not attainable.

**Figure 6:. Species contribution to the overall boost in mutli-species results shows to be differential in nature, still ultimately resulting in benefit.**
We seek to quantify the effect of holding out species: (1) individually, and (2) in groups, to understand how it changes model performance under the scope of using MORALE for domain-adaption. In **(A)** we present the results for the per-species holdout performance. ‘No knockout’ describes the model performance when all source species are used, with human as the target species. From left to right, we proceed to holdout a single species from the overall source species to determine its holdout’s affect on performance. In **(B)** we display the results for holding our groups of species at a time. From left to right, we display a monotonically decreasing set of source species from all included (i.e., leftmost) to just a singular species (i.e., rightmost). This showcases that the number of species included in training set does aide in overall model performance.

See this image and copyright information in PMC

References

1. Hu Yan et al. “Multiscale footprints reveal the organization of cis-regulatory elements”. In: Nature 638 (Feb. 2025), pp. 779–786. ISSN: 1476–4687. DOI: 10.1038/s41586-024-08443-4. - DOI - PMC - PubMed
1. Pampari Anusri et al. “ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants”. In: bioRxiv (Jan. 2025), p. 2024.12.25.630221. eprint: 2024.12.25.630221. URL: 10.1101/2024.12.25.630221. - DOI
1. Brixi Garyk et al. “Genome modeling and design across all domains of life with Evo 2”. In: bioRxiv (2025). DOI: 10.1101/2025.02.18.638918. eprint: https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918.full.pdf. URL: https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918. - DOI
1. Patel Aman et al. “DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA”. In: arXiv (Dec. 2024). DOI: 10.48550/arXiv.2412.05430. eprint: 2412.05430. - DOI
1. Chen Kathleen M. et al. “A sequence-based global map of regulatory activity for deciphering human genetics”. In: Nat. Genet. 54 (July 2022), pp. 940–949. ISSN: 1546–1718. DOI: 10.1038/s41588-022-01102-2. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction

Affiliations

"Frustratingly easy" domain adaptation for cross-species transcription factor binding prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous