Fine-tuning a sentence transformer for DNA
- PMID: 41162866
- PMCID: PMC12574151
- DOI: 10.1186/s12859-025-06291-1
Fine-tuning a sentence transformer for DNA
Abstract
Background: Sentence-transformers is a library that provides easy methods for generating embeddings for sentences, paragraphs, and images. Sentiment analysis, retrieval, and clustering are among the applications made possible by the embedding of texts in a vector space where similar texts are located close to one another. This study fine-tunes a sentence transformer model designed for natural language on DNA text and subsequently evaluates it across eight benchmark tasks. The objective is to assess the efficacy of this transformer in comparison to domain-specific DNA transformers, like DNABERT and the Nucleotide transformer.
Results: The findings indicated that the refined proposed model generated DNA embeddings that exceeded DNABERT in multiple tasks. However, the proposed model was not superior to the nucleotide transformer in terms of raw classification accuracy. The nucleotide transformer excelled in most tasks; but, this superiority incurred significant computing expenses, rendering it impractical for resource-constrained environments such as low- and middle-income countries (LMICs). The nucleotide transformer also performed worse on retrieval tasks and embedding extraction time. Consequently, the proposed model presents a viable option that balances performance and accuracy.
Keywords: BERT; DNABERT; Sentence transformers; SimCSE; The nucleotide transformer.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: T1 dataset: Ethics approval for T1 data was granted by the University of Pretoria EBIT Research Ethics Committee (EBIT/139/2020), together with the South Eastern Sydney Local Health District Human Research Ethics Committee (approval number H00/022 and 00113). and all participants provided written informed consent. T2 dataset: The EBIT Research Ethics Committee at the University of Pretoria, South Africa, granted ethical approval (Ethics Reference No: 43/2010; 11 August 2020) for the utilization of blood BRCA 1 DNA sequences from twelve patients with histopathological ISUP-Grade Group of 1 (representing low-risk prostate cancer) and 5 (representing high-risk prostate cancer). The patients were enrolled and provided consent in accordance with the approval obtained from the University of Pretoria Faculty of Health Sciences Research Ethics Committee (43/2010) in South Africa and the DNA sequencing was conducted with approval from the St. Vincent’s Hospital Human Research Ethics Committee (HREC) SVH/15/227 in Sydney, Australia. T3 dataset: This dataset was obtained from ENCODE and other public genome annotation repositories. [23]. Repositories already handle the ethics approvals and informed consent at the stage of data collection. T4–T7 datasets: These datasets were drawn from public biology/genomics repositories (e.g. ENCODE, human genomes) [24]. Repositories already handle the ethics approvals and informed consent at the stage of data collection. Accordingly, all datasets used in this study were obtained and utilized in full compliance with the Declaration of Helsinki. Consent for publication: Not applicable. Competing interest: The authors declare no competing interest.
Figures
References
-
- Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv. 2021;56:1–40.
-
- Wang H, Li J, Wu H, Hovy E, Sun Y. Pre-trained language models and their applications. Engineering. 2022;25:51–65.
-
- Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding; 2018. arXiv preprint arXiv:1810.04805.
-
- Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-xl: attentive language models beyond a fixed-length context; 2019. arXiv preprint arXiv:1901.02860.
-
- Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst. 2019;32:64.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
