Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Sep 23;25(6):bbae548.
doi: 10.1093/bib/bbae548.

Progress and opportunities of foundation models in bioinformatics

Affiliations
Review

Progress and opportunities of foundation models in bioinformatics

Qing Li et al. Brief Bioinform. .

Abstract

Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.

Keywords: artificial intelligence; bioinformatics; foundation models; large language models.

PubMed Disclaimer

Figures

Figure 1
Figure 1
FMs in artificial intelligence (AI) and bioinformatics. (i) FMs in AI. General FMs are predominantly pretrained on diverse digital data and fine-tuned for various computer applications, such as question-answering systems, image design, and computer games. (ii) FMs in bioinformatics. FMs in bioinformatics primarily focus on core biological problems including biological sequence analysis, biological structure construction, and biological function prediction, encompassing both annotated and unannotated biological datasets. They can undergo pretraining on multiple phases of biological data for diverse downstream tasks. Based on the pretraining architectures of the foundation model, they can be classified into discriminative FMs, which capture complex patterns and relationships within annotated data through masking strategies for classification or regression tasks, and generative FMs, which focus on generating semantic features and context from novel data or predicting their associations. (iii) Deep learning modules. Deep learning modules are the cornerstone of building encoders and decoders in FMs. Commonly used modules include MLP, CNN, AutoEncoder (input is consistent with output), GCN (input with graph structure), and transformer (rectangle represents attention mechanism). All these deep learning modules can be trained in an end-to-end manner, enhancing computational efficiency through parallel processing mechanisms.
Figure 2
Figure 2
Timeline of FMs in bioinformatics and their background in deep learning. The emergence of FMs in bioinformatics coincided with the ascent of deep learning, gaining significant momentum as these models showcased remarkable advancements in the era of big data. Landmark achievements such as Alpha Go, the first robot to meet top standards, significantly enriched the landscape of deep learning. Subsequent developments, exemplified by AlphaFold and AlphaFold2, revolutionized protein structure prediction from biological sequences. The introduction of GPT4 marked a pivotal moment, catalyzing a surge in the application of FMs. These strides propelled FMs (including discriminative FMs and generative FMs) in bioinformatics to acquire salient information for practical applications in biology.
Figure 3
Figure 3
Challenges and opportunities in applying FMs for biological problems. FMs for addressing biological problems face hurdles related to biological data, model structures, and their social influence, which concurrently catalyzes opportunities in bioinformatics due to the increasing availability of biological data, advancements in FMs, and their versatile real-world applications. The top half of this figure outlines challenges such as data noise and sparsity, increasing data diversity, long sequence length, and multimodality in biological data collection. Additionally, challenges in training efficiency, model explainability, and establishing evaluation standards in model design and construction are depicted. Social influences, including ethics and fairness, privacy concerns, potential misuse, and social bias, further compound these challenges. Conversely, the bottom half of the figure illustrates emerging opportunities driven by the proliferation of diverse biological data types and volumes, including RNA, DNA, scMultiomics, proteins, and knowledge graphs/networks. The enhancement of FMs, particularly through pretrained mechanisms, presents another avenue for progress. Moreover, a wide range of applications spanning surgery, hormonal therapy, immunotherapy, radiotherapy, personalized therapy, chemotherapy, bone marrow transplant, drug discovery, and online healthcare, underscore the potential impact of FMs in bioinformatics. These developments signal a promising trajectory for the application of FMs in addressing biological complexities.

References

    1. Hughes JP, Rees S, Kalindjian SB. et al. . Principles of early drug discovery. Br J Pharmacol 2011;162:1239–49. 10.1111/j.1476-5381.2010.01127.x. - DOI - PMC - PubMed
    1. Bommasani DA, Hudson E, Adeli E. et al. . On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
    1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019;25:44–56. 10.1038/s41591-018-0300-7. - DOI - PubMed
    1. Park Y S, Lek S. Artificial Neural Networks: Multilayer Perceptron for Ecological Modeling[M]. In: Jørgensen SE, (eds.), Developments in Environmental Modeling. Netherlands: Elsevier, 2016;28: 123–40, 10.1016/B978-0-444-63623-2.00007-4. - DOI
    1. Wang M, Tai CEW, Wei L. DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res 2018;46:e69–9. 10.1093/nar/gky215. - DOI - PMC - PubMed