. 2025 Mar 4;26(2):bbaf149.

doi: 10.1093/bib/bbaf149.

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Chenrui Duan^{1

2}, Zelin Zang³, Yongjie Xu^{1

2}, Hang He⁴, Siyuan Li^{1

2}, Zihan Liu^{1

2}, Zhen Lei^{3

5

6}, Ju-Sheng Zheng⁴, Stan Z Li²

Affiliations

¹ College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China.
² School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China.
³ Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China.
⁴ School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China.
⁵ State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China.
⁶ School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China.

PMID: 40211978
PMCID: PMC11986344
DOI: 10.1093/bib/bbaf149

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Chenrui Duan et al. Brief Bioinform. 2025.

. 2025 Mar 4;26(2):bbaf149.

doi: 10.1093/bib/bbaf149.

Authors

Chenrui Duan^{1

2}, Zelin Zang³, Yongjie Xu^{1

2}, Hang He⁴, Siyuan Li^{1

2}, Zihan Liu^{1

2}, Zhen Lei^{3

5

6}, Ju-Sheng Zheng⁴, Stan Z Li²

Affiliations

¹ College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China.
² School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China.
³ Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China.
⁴ School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China.
⁵ State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China.
⁶ School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China.

PMID: 40211978
PMCID: PMC11986344
DOI: 10.1093/bib/bbaf149

Abstract

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT's capability for functional recognition and its biological relevance in metagenomic research.

Keywords: DNA; metagenomics; pre-trained language model; transformer.

PubMed Disclaimer

Figures

**Figure 1**
Motivaion. Two types of complex relationships between gene sequences and functions in metagenomics. **One-to-Many** problem means that the same gene may display different functions based on the genomic context; for example, ATP synthase works differently in plants, heterotrophic bacteria, and humans. **Many-to-One** problem shows that multiple genes may perform the same function; for instance, different genes from different bacteria, e.g., Cpf1, Cas1, etc., produce the same resistance function within the immune system CRISPR.

**Figure 2**
Overview of FGeneBERT. A metagenomic sequence is converted into ordered protein-based gene representations via a Context-Aware Tokenizer. Next, we pre-train a Gene Encoder with , 15% of these tokens are masked to predict labels . Meanwhile, we introduce to distinguish gene sequences. The data augmentation and negative sampling modules generate positive samples and negative samples , respectively. Finally, after fine-tuning, FGeneBERT can handle various downstream tasks.

formula image — **Figure 2**
Overview of FGeneBERT. A metagenomic sequence is converted into ordered protein-based gene representations via a Context-Aware Tokenizer. Next, we pre-train a Gene Encoder with , 15% of these tokens are masked to predict labels . Meanwhile, we introduce to distinguish gene sequences. The data augmentation and negative sampling modules generate positive samples and negative samples , respectively. Finally, after fine-tuning, FGeneBERT can handle various downstream tasks.

**Figure 3**
Framework of Context-Aware Tokenizer. Gene sequences extracted from metagenomic sequence are translated into amino acid sequences and encoded into ESM-2 representations as . These representations are then concatenated sequentially every representation to form gene groups .

**Figure 4**
Visualization of attention. The attention value is from the first head of the last (19th) attention layer. Darker shading indicates higher attention weight.

**Figure 5**
Ablation studies of our proposed modules on four downstream tasks.

**Figure 6**
T-SNE visualization of different embeddings for ATP synthases. Each dot denotes a sequence and is grouped according to different functions.

**Figure 7**
Comparative analysis on tokenization efficiency: time(s) vs. memory (MB). Each point denotes a specific dataset, with the size indicating its scale.

**Figure 8**
Sensitivity w.r.t. hyper-parameters of CARD dataset on AMR gene family.

See this image and copyright information in PMC

Cited by

PRCFX-DT: a new graph-based approach for feature selection and classification of genomic sequences.
Khodaei A, Eskandari S, Sharifi H, Mozaffari-Tazehkand B. Khodaei A, et al. BMC Bioinformatics. 2025 Jun 17;26(1):159. doi: 10.1186/s12859-025-06183-4. BMC Bioinformatics. 2025. PMID: 40528202 Free PMC article.
AI-Driven Antimicrobial Peptide Discovery: Mining and Generation.
Szymczak P, Zarzecki W, Wang J, Duan Y, Wang J, Coelho LP, de la Fuente-Nunez C, Szczurek E. Szymczak P, et al. Acc Chem Res. 2025 Jun 17;58(12):1831-1846. doi: 10.1021/acs.accounts.0c00594. Epub 2025 Jun 3. Acc Chem Res. 2025. PMID: 40459283 Free PMC article. Review.

References

1. Mathieu A, Leclercq M, Sanabria M. et al. Machine learning and deep learning applications in metagenomic taxonomy and functional annotation. Front Microbiol 2022;13:811495. 10.3389/fmicb.2022.811495 - DOI - PMC - PubMed
1. De D, Nayak T, Das G. et al. Metagenomics and bioinformatics in microbial ecology: current status and beyond. In: Thatoi H, Pradhan SK, Kumar U. (Eds.), Applications of Metagenomics. Amsterdam, Netherlands: Elsevier, 2024, pp. 359–85. 10.1016/B978-0-323-98394-5.00009-2. - DOI
1. Han J, Zhang H, Ning K. Techniques for learning and transferring knowledge for microbiome-based classification and prediction: review and assessment. Brief Bioinform 2025;26:bbaf015. - PMC - PubMed
1. Duan CR, Zang Z, Li S. et al. Phylogen: language model-enhanced phylogenetic inference via graph structure generation. Adv Neural Inform Process Syst 2024;37:131676–703.
1. Ariaeenejad S, Gharechahi J, Shahraki MF. et al. Precision enzyme discovery through targeted mining of metagenomic data. Na Products Bioprospect 2024;14:7. 10.1007/s13659-023-00426-8 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Affiliations

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous