Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 29;26(1):267.
doi: 10.1186/s12859-025-06291-1.

Fine-tuning a sentence transformer for DNA

Affiliations

Fine-tuning a sentence transformer for DNA

Mpho Mokoatle et al. BMC Bioinformatics. .

Abstract

Background: Sentence-transformers is a library that provides easy methods for generating embeddings for sentences, paragraphs, and images. Sentiment analysis, retrieval, and clustering are among the applications made possible by the embedding of texts in a vector space where similar texts are located close to one another. This study fine-tunes a sentence transformer model designed for natural language on DNA text and subsequently evaluates it across eight benchmark tasks. The objective is to assess the efficacy of this transformer in comparison to domain-specific DNA transformers, like DNABERT and the Nucleotide transformer.

Results: The findings indicated that the refined proposed model generated DNA embeddings that exceeded DNABERT in multiple tasks. However, the proposed model was not superior to the nucleotide transformer in terms of raw classification accuracy. The nucleotide transformer excelled in most tasks; but, this superiority incurred significant computing expenses, rendering it impractical for resource-constrained environments such as low- and middle-income countries (LMICs). The nucleotide transformer also performed worse on retrieval tasks and embedding extraction time. Consequently, the proposed model presents a viable option that balances performance and accuracy.

Keywords: BERT; DNABERT; Sentence transformers; SimCSE; The nucleotide transformer.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: T1 dataset: Ethics approval for T1 data was granted by the University of Pretoria EBIT Research Ethics Committee (EBIT/139/2020), together with the South Eastern Sydney Local Health District Human Research Ethics Committee (approval number H00/022 and 00113). and all participants provided written informed consent. T2 dataset: The EBIT Research Ethics Committee at the University of Pretoria, South Africa, granted ethical approval (Ethics Reference No: 43/2010; 11 August 2020) for the utilization of blood BRCA 1 DNA sequences from twelve patients with histopathological ISUP-Grade Group of 1 (representing low-risk prostate cancer) and 5 (representing high-risk prostate cancer). The patients were enrolled and provided consent in accordance with the approval obtained from the University of Pretoria Faculty of Health Sciences Research Ethics Committee (43/2010) in South Africa and the DNA sequencing was conducted with approval from the St. Vincent’s Hospital Human Research Ethics Committee (HREC) SVH/15/227 in Sydney, Australia. T3 dataset: This dataset was obtained from ENCODE and other public genome annotation repositories. [23]. Repositories already handle the ethics approvals and informed consent at the stage of data collection. T4–T7 datasets: These datasets were drawn from public biology/genomics repositories (e.g. ENCODE, human genomes) [24]. Repositories already handle the ethics approvals and informed consent at the stage of data collection. Accordingly, all datasets used in this study were obtained and utilized in full compliance with the Declaration of Helsinki. Consent for publication: Not applicable. Competing interest: The authors declare no competing interest.

Figures

Fig. 1
Fig. 1
The proposed model utilizes a pretrained checkpoint of the unsupervised SimCSE model from Hugging Face [18] with the following modified training script [20]. The model was trained on k-mer DNA sequences (formula image) sampled from the human reference genome for 1 epoch using a batch size of 16 and a maximum sequence length of 312. Next, the fine-tuned model was utilized to create sentence embeddings for DNA tasks. The final step involved using the generated sentence embeddings as input to machine learning algorithms for the classification of the eight DNA tasks described in Table 2
Fig. 2
Fig. 2
Embedding extraction time by model
Fig. 3
Fig. 3
Retrieval benchmark

References

    1. Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv. 2021;56:1–40.
    1. Wang H, Li J, Wu H, Hovy E, Sun Y. Pre-trained language models and their applications. Engineering. 2022;25:51–65.
    1. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding; 2018. arXiv preprint arXiv:1810.04805.
    1. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-xl: attentive language models beyond a fixed-length context; 2019. arXiv preprint arXiv:1901.02860.
    1. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst. 2019;32:64.

LinkOut - more resources