Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 4;26(2):bbaf149.
doi: 10.1093/bib/bbaf149.

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Affiliations

FGeneBERT: function-driven pre-trained gene language model for metagenomics

Chenrui Duan et al. Brief Bioinform. .

Abstract

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT's capability for functional recognition and its biological relevance in metagenomic research.

Keywords: DNA; metagenomics; pre-trained language model; transformer.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Motivaion. Two types of complex relationships between gene sequences and functions in metagenomics. One-to-Many problem means that the same gene may display different functions based on the genomic context; for example, ATP synthase works differently in plants, heterotrophic bacteria, and humans. Many-to-One problem shows that multiple genes may perform the same function; for instance, different genes from different bacteria, e.g., Cpf1, Cas1, etc., produce the same resistance function within the immune system CRISPR.
Figure 2
Figure 2
Overview of FGeneBERT. A metagenomic sequence formula image is converted into ordered protein-based gene representations formula image via a Context-Aware Tokenizer. Next, we pre-train a Gene Encoder with formula image, 15% of these tokens are masked to predict labels formula image. Meanwhile, we introduce formula image to distinguish gene sequences. The data augmentation and negative sampling modules generate positive samples formula image and negative samples formula image, respectively. Finally, after fine-tuning, FGeneBERT can handle various downstream tasks.
Figure 3
Figure 3
Framework of Context-Aware Tokenizer. Gene sequences formula image extracted from metagenomic sequence formula image are translated into amino acid sequences and encoded into ESM-2 representations as formula image. These representations are then concatenated sequentially every formula image representation to form gene groups formula image.
Figure 4
Figure 4
Visualization of attention. The attention value is from the first head of the last (19th) attention layer. Darker shading indicates higher attention weight.
Figure 5
Figure 5
Ablation studies of our proposed modules on four downstream tasks.
Figure 6
Figure 6
T-SNE visualization of different embeddings for ATP synthases. Each dot denotes a sequence and is grouped according to different functions.
Figure 7
Figure 7
Comparative analysis on tokenization efficiency: time(s) vs. memory (MB). Each point denotes a specific dataset, with the size indicating its scale.
Figure 8
Figure 8
Sensitivity w.r.t. hyper-parameters formula image of CARD dataset on AMR gene family.

Similar articles

Cited by

References

    1. Mathieu A, Leclercq M, Sanabria M. et al. Machine learning and deep learning applications in metagenomic taxonomy and functional annotation. Front Microbiol 2022;13:811495. 10.3389/fmicb.2022.811495 - DOI - PMC - PubMed
    1. De D, Nayak T, Das G. et al. Metagenomics and bioinformatics in microbial ecology: current status and beyond. In: Thatoi H, Pradhan SK, Kumar U. (Eds.), Applications of Metagenomics. Amsterdam, Netherlands: Elsevier, 2024, pp. 359–85. 10.1016/B978-0-323-98394-5.00009-2. - DOI
    1. Han J, Zhang H, Ning K. Techniques for learning and transferring knowledge for microbiome-based classification and prediction: review and assessment. Brief Bioinform 2025;26:bbaf015. - PMC - PubMed
    1. Duan CR, Zang Z, Li S. et al. Phylogen: language model-enhanced phylogenetic inference via graph structure generation. Adv Neural Inform Process Syst 2024;37:131676–703.
    1. Ariaeenejad S, Gharechahi J, Shahraki MF. et al. Precision enzyme discovery through targeted mining of metagenomic data. Na Products Bioprospect 2024;14:7. 10.1007/s13659-023-00426-8 - DOI - PMC - PubMed