. 2025 Oct 29;26(1):267.

doi: 10.1186/s12859-025-06291-1.

Fine-tuning a sentence transformer for DNA

Mpho Mokoatle^{1

2}, Vukosi Marivate^#³, Darlington Mapiye^#⁴, Riana Bornman^#⁵, Vanessa M Hayes^#^{6

5

7}

Affiliations

¹ Department of Computer Science, University of Pretoria, Pretoria, South Africa. mphomokoatle64@gmail.com.
² NGEI, CSIR, Pretoria, South Africa. mphomokoatle64@gmail.com.
³ Department of Computer Science, University of Pretoria, Pretoria, South Africa.
⁴ AstraZeneca, London, UK.
⁵ School of Health Systems and Public Health, University of Pretoria, Pretoria, South Africa.
⁶ Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, The University of Sydney, Sydney, Australia.
⁷ Manchester Cancer Research Centre, University of Manchester, Manchester, United Kingdom.

^# Contributed equally.

PMID: 41162866
PMCID: PMC12574151
DOI: 10.1186/s12859-025-06291-1

Fine-tuning a sentence transformer for DNA

Mpho Mokoatle et al. BMC Bioinformatics. 2025.

. 2025 Oct 29;26(1):267.

doi: 10.1186/s12859-025-06291-1.

Authors

Mpho Mokoatle^{1

2}, Vukosi Marivate^#³, Darlington Mapiye^#⁴, Riana Bornman^#⁵, Vanessa M Hayes^#^{6

5

7}

Affiliations

¹ Department of Computer Science, University of Pretoria, Pretoria, South Africa. mphomokoatle64@gmail.com.
² NGEI, CSIR, Pretoria, South Africa. mphomokoatle64@gmail.com.
³ Department of Computer Science, University of Pretoria, Pretoria, South Africa.
⁴ AstraZeneca, London, UK.
⁵ School of Health Systems and Public Health, University of Pretoria, Pretoria, South Africa.
⁶ Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, The University of Sydney, Sydney, Australia.
⁷ Manchester Cancer Research Centre, University of Manchester, Manchester, United Kingdom.

^# Contributed equally.

PMID: 41162866
PMCID: PMC12574151
DOI: 10.1186/s12859-025-06291-1

Abstract

Background: Sentence-transformers is a library that provides easy methods for generating embeddings for sentences, paragraphs, and images. Sentiment analysis, retrieval, and clustering are among the applications made possible by the embedding of texts in a vector space where similar texts are located close to one another. This study fine-tunes a sentence transformer model designed for natural language on DNA text and subsequently evaluates it across eight benchmark tasks. The objective is to assess the efficacy of this transformer in comparison to domain-specific DNA transformers, like DNABERT and the Nucleotide transformer.

Results: The findings indicated that the refined proposed model generated DNA embeddings that exceeded DNABERT in multiple tasks. However, the proposed model was not superior to the nucleotide transformer in terms of raw classification accuracy. The nucleotide transformer excelled in most tasks; but, this superiority incurred significant computing expenses, rendering it impractical for resource-constrained environments such as low- and middle-income countries (LMICs). The nucleotide transformer also performed worse on retrieval tasks and embedding extraction time. Consequently, the proposed model presents a viable option that balances performance and accuracy.

Keywords: BERT; DNABERT; Sentence transformers; SimCSE; The nucleotide transformer.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: T1 dataset: Ethics approval for T1 data was granted by the University of Pretoria EBIT Research Ethics Committee (EBIT/139/2020), together with the South Eastern Sydney Local Health District Human Research Ethics Committee (approval number H00/022 and 00113). and all participants provided written informed consent. T2 dataset: The EBIT Research Ethics Committee at the University of Pretoria, South Africa, granted ethical approval (Ethics Reference No: 43/2010; 11 August 2020) for the utilization of blood BRCA 1 DNA sequences from twelve patients with histopathological ISUP-Grade Group of 1 (representing low-risk prostate cancer) and 5 (representing high-risk prostate cancer). The patients were enrolled and provided consent in accordance with the approval obtained from the University of Pretoria Faculty of Health Sciences Research Ethics Committee (43/2010) in South Africa and the DNA sequencing was conducted with approval from the St. Vincent’s Hospital Human Research Ethics Committee (HREC) SVH/15/227 in Sydney, Australia. T3 dataset: This dataset was obtained from ENCODE and other public genome annotation repositories. [23]. Repositories already handle the ethics approvals and informed consent at the stage of data collection. T4–T7 datasets: These datasets were drawn from public biology/genomics repositories (e.g. ENCODE, human genomes) [24]. Repositories already handle the ethics approvals and informed consent at the stage of data collection. Accordingly, all datasets used in this study were obtained and utilized in full compliance with the Declaration of Helsinki. Consent for publication: Not applicable. Competing interest: The authors declare no competing interest.

Figures

**Fig. 1**
The proposed model utilizes a pretrained checkpoint of the unsupervised SimCSE model from Hugging Face [18] with the following modified training script [20]. The model was trained on k-mer DNA sequences () sampled from the human reference genome for 1 epoch using a batch size of 16 and a maximum sequence length of 312. Next, the fine-tuned model was utilized to create sentence embeddings for DNA tasks. The final step involved using the generated sentence embeddings as input to machine learning algorithms for the classification of the eight DNA tasks described in Table 2

formula image — **Fig. 1**
The proposed model utilizes a pretrained checkpoint of the unsupervised SimCSE model from Hugging Face [18] with the following modified training script [20]. The model was trained on k-mer DNA sequences () sampled from the human reference genome for 1 epoch using a batch size of 16 and a maximum sequence length of 312. Next, the fine-tuned model was utilized to create sentence embeddings for DNA tasks. The final step involved using the generated sentence embeddings as input to machine learning algorithms for the classification of the eight DNA tasks described in Table 2

**Fig. 2**
Embedding extraction time by model

See this image and copyright information in PMC

References

1. Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv. 2021;56:1–40. - DOI
1. Wang H, Li J, Wu H, Hovy E, Sun Y. Pre-trained language models and their applications. Engineering. 2022;25:51–65. - DOI
1. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding; 2018. arXiv preprint arXiv:1810.04805.
1. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-xl: attentive language models beyond a fixed-length context; 2019. arXiv preprint arXiv:1901.02860.
1. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst. 2019;32:64.

MeSH terms

Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fine-tuning a sentence transformer for DNA

Affiliations

Fine-tuning a sentence transformer for DNA

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources