Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Jul 2;26(4):bbaf357.
doi: 10.1093/bib/bbaf357.

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics

Affiliations
Review

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics

Anqi Lin et al. Brief Bioinform. .

Abstract

Large language models (LLMs), representing a breakthrough advancement in artificial intelligence, have demonstrated substantial application value and development potential in bioinformatics research, particularly showing significant progress in the processing and analysis of complex biological data. This comprehensive review systematically examines the development and applications of LLMs in bioinformatics, with particular emphasis on their advancements in protein and nucleic acid structure prediction, omics analysis, drug design and screening, and biomedical literature mining. This work highlights the distinctive capabilities of LLMs in end-to-end learning and knowledge transfer paradigms. Additionally, this paper thoroughly discusses the major challenges confronting LLMs in current applications, including key issues such as model interpretability and data bias. Furthermore, this review comprehensively explores the potential of LLMs in cross-modal learning and interdisciplinary development. In conclusion, this paper aims to systematically summarize the current research status of LLMs in bioinformatics, objectively evaluate their advantages and limitations, and provide insights and recommendations for future research directions, thereby positioning LLMs as essential tools in bioinformatics research and fostering innovative developments in the biomedical field.

Keywords: LLMs; artificial intelligence; bioinformatics; large language models.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Contemporary applications and advances of LLMs in bioinformatics. This figure systematically categorizes computational tools and models in bioinformatics and biomedicine into five main domains: 1. DNA/RNA sequence analysis, functional and structure prediction: includes tools for sequence analysis (e.g. HvenaDNA, DNAGPT), functional prediction (e.g. BERT-enhancer, DNABERT), and structure-focused methods (e.g. RNABERT, GeoBoost2). 2. Protein sequence analysis, functional and structure prediction: covers protein sequence modeling (e.g. ESM, ProtGPT), post-translational modification prediction (e.g. EpiBERTope, TransPPMP), and structural analysis (e.g. ProteinBERT, MSA transformer). 3. Multi-omics data analysis: features tools for genomics (e.g. scGPT, iDNA-ABT), epigenomics (e.g. scELMo, Mul_an-methyl), and integrative omics approaches (e.g. DeepGene transformer, POOE). 4. Computational drug discovery and design: includes models for molecular design (e.g. MolGPT, ChemBERTa), drug–target interaction (e.g. DT-I-BERT, TransDTI), and pharmaceutical applications (e.g. PharmBERT). 5. Biomedical literature mining: lists NLP models for biomedical text analysis (e.g. BioBERT, ClinicalBERT, Galactica). This figure was created based on the tools provided by Biorender.com (accessed on 15 May 2025).
Figure 2
Figure 2
Key advantages of large language models in bioinformatics research. The implementation of large language models (LLMs) in bioinformatics demonstrates several distinct advantages, primarily in their capability to process extended sequences and high-dimensional data, capture complex semantic and contextual information, perform cross-modal learning and knowledge transfer, reduce manual feature engineering through end-to-end learning, and leverage massive unlabeled data through self-supervised learning. The processing of extended sequences and high-dimensional data is facilitated through advanced sequence tokenization techniques, integrated dimensionality reduction technologies, autoencoder architectures, and multi-head attention mechanisms. Through self-supervised learning approaches and transformer-based architectures, particularly bidirectional encoder representations from transformers (BERT), LLMs demonstrate superior capability in capturing intricate semantic relationships and contextual information. Moreover, LLMs exhibit remarkable efficacy in integrating and processing multimodal data, including textual, visual, and audio inputs, while achieving efficient cross-corpus transfer. The self-supervised learning paradigm, leveraging vast quantities of unlabeled data, utilizes sophisticated multilayer neural network architectures to automatically process complex biological data. Additionally, end-to-end learning approaches significantly reduce the necessity for manual feature engineering, effectively addressing the limitations of traditional supervised learning’s dependence on manually annotated data, particularly in applications such as protein sequence prediction and nucleotide sequence analysis. This figure was created based on the tools provided by Biorender.com (accessed on 15 May 2025). LLMs, large language models; BERT, bidirectional encoder representations from transformers.
Figure 3
Figure 3
Future development directions of large language models in bioinformatics. The future development and applications of large language models (LLMs) in bioinformatics encompass several key areas: multimodal fusion learning, knowledge-guided architectural design, model optimization for lightweight deployment, efficient inference, development of explainable artificial intelligence systems, deep integration with experimental biology, enhancement of ethical and privacy protection mechanisms, and promotion of interdisciplinary collaboration and open science. Specifically, multimodal fusion learning can be advanced through multidimensional deep analysis, systematic integration of multi-omics data, and enhancement of model generalization capabilities. Regarding knowledge-guided architectural design, integration of biomedical ontologies and knowledge graphs into model frameworks is essential, alongside the development of knowledge distillation techniques for constructing adaptive learning systems with automatic knowledge base updating capabilities. Model optimization and efficient inference can be achieved through specialized attention mechanisms, model distillation techniques, and implementation of federated learning strategies. In the development of explainable artificial intelligence (AI) systems, advanced visualization techniques and counterfactual explanation methods warrant investigation, coupled with the development of interactive interpretation systems to enhance model transparency. To facilitate deep integration between LLMs and experimental biology, intelligent experimental design systems should be developed, incorporating experimental feedback mechanisms and comprehensive evaluation frameworks that bridge computational predictions with experimental outcomes. For strengthening ethical and privacy protection mechanisms, advanced technologies including federated learning, bias mitigation, and differential privacy should be explored, while establishing robust ethical review systems and standardized regulatory frameworks at the institutional level. In the context of interdisciplinary collaboration and open science, development of cross-disciplinary research tools should be prioritized, along with the establishment of open-access biological databases, standardized evaluation benchmarks, and comprehensive open-source platforms. This figure was created based on the tools provided by Biorender.com (accessed on 15 May 2025).

Similar articles

References

    1. Chaussabel D. Biomedical literature mining: challenges and solutions in the ‘omics’ era. Am J Pharmacogenomics 2004;4:383–93. 10.2165/00129785-200404060-00005. - DOI - PubMed
    1. Zhang J, Li H, Tao W. et al. GseaVis: an R package for enhanced visualization of gene set enrichment analysis in biomedicine. Med Research 1:131–5. 10.1002/mdr2.70000. - DOI
    1. Chen J, Lin A, Jiang A. et al. Computational frameworks transform antagonism to synergy in optimizing combination therapies. NPJ Digit Med 2025;8:44. 10.1038/s41746-025-01435-2. - DOI - PMC - PubMed
    1. Eraslan G, Avsec Ž, Gagneur J. et al. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 2019;20:389–403. 10.1038/s41576-019-0122-6. - DOI - PubMed
    1. Hacking S. ChatGPT and medicine: together we embrace the AI renaissance. JMIR Bioinform Biotechnol 2024;5:e52700. 10.2196/52700. - DOI - PMC - PubMed