Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 26;26(1):285.
doi: 10.1186/s12859-025-06286-y.

Protein language models uncover carbohydrate-active enzyme function in metagenomics

Affiliations

Protein language models uncover carbohydrate-active enzyme function in metagenomics

Kumar Thurimella et al. BMC Bioinformatics. .

Abstract

Background: The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.

Results: CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn's disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn's disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.

Conclusions: CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.

Keywords: CAZymes; Crohn’s disease; Deep learning; Fibrosis; IgG4-related disease; Protein language models; Systemic sclerosis.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: R.J.X. is a co-founder of Jnana Therapeutics and Convergence Bio, scientific advisory board member at Nestlé, Magnet BioMedicine, and Arena BioWorks, and board director at MoonLake Immunotherapeutics. D.R.P. is an employee of Novozymes A/S, Denmark. These organizations had no roles in this study.

Figures

Fig. 1
Fig. 1
CAZyLingua: a deep learning model used for the classification of proteins as CAZymes. a The workflow of CAZyLingua starts with raw embeddings from ProtT5 followed by the use of those embeddings as input through two classifiers to distinguish (1) whether the embedding was a CAZyme and if so, (2) to which CAZyme family it belongs. b The training strategy for CAZyLingua began with a 60% sequence identity clustering to remove redundancy from the CAZy database in order to train on distinct CAZymes. The cross-entropy loss function was applied for training and the loss function that was used included a weighted balancing function to proportionally sample the number of representative sequences per CAZyme class/family/subfamily in the database. This strategy was employed so as not to oversample on highly represented families
Fig. 2
Fig. 2
CAZyLingua performance relative to the BLAST-based CAZyme annotation tool dbCAN2. The final CAZyLingua Random Forest (RF) model was benchmarked using gold-standard annotated genomes. a Head-to-head comparison of F1 scores for CAZyLingua (RF), dbCAN2 (HMM and DIAMOND), and CUPP on three initial test genomes. b ROC (top) and precision-recall (bottom) curves comparing the CAZyLingua RF classifier against the dbCAN2 HMM on predictions from the three benchmarked genomes, with the genomes held out of the training data. The RF model’s prediction probabilities and the dbCAN2 DIAMOND e-values were used as the respective target scores for generating the curves. c Performance of the CAZyLingua RF model across an expanded set of 12 gold-standard bacterial genomes, demonstrating robust precision, recall, and F1 scores across diverse taxa
Fig. 3
Fig. 3
Application of CAZyLingua to metagenomes from paired mothers and infants. a Comparison of CAZyLingua to eggNOG and dbCAN2 on a large metagenomics gene catalog from mothers and their infants. Time of the sample is in months relative to childbirth (month 0). Dotted lines represent no fold change. b CAZyLingua predicted 27,891 genes that dbCAN2 did not, shown by CAZy class for all infant and maternal samples at each sample month. Boxplots in a and b show medians and interquartile ranges (IQRs), with whiskers showing ± 1.5 IQR. c Predicted structures of proteins from CAZyLingua (red) and the protein embedding nearest neighbor (grey) structurally aligned, with TM scores and BLAST metrics, for GH88, GH10, and GH63
Fig. 4
Fig. 4
CAZyLingua distinguishes GH33 CAZyme from nearest neighbors of raw ProtT5 embeddings. a tSNE of (left) ProtT5 embeddings from the GH33 and GH43_18 families and the CAZyme predicted by CAZyLingua (GH unknown) and (right) a segment of the last layer of CAZyLingua. b GH33 protein residues were mutated in a sliding window of ten residues over the entire sequence, and ProtT5 embeddings were generated for each sliding window mutation. Known features were overlaid along sections of the sequence. The probability of the CAZyLingua-predicted classification being a GH33 was calculated for each sliding window mutation (top). The predicted GH mapped to a PUL containing several regulatory elements consistent with a CAZyme (bottom left). BLAST metrics on the predicted GH signal peptide compared with GH33 and GH43_18 sequences (bottom right). c Overlays of the predicted GH protein structure generated using ColabFold with a sialidase (top) and a neuraminidase (bottom)
Fig. 5
Fig. 5
Application of CAZyLingua to CAZymes in metagenomes of patients with inflammatory and fibrosis-prone diseases. a Genes enriched and depleted in the gene catalogs of patients with IgG4-RD selected on the fringe of the volcano plot (see Methods for labeling criteria). b Predicted CEs in the enriched IgG4-RD gene set, stratified to analyze only the genes CAZyLingua predicted. c Genes enriched and depleted in the gene catalogs of patients with CD selected on the fringe of the volcano plot (see Methods for labeling criteria). CE17 is highlighted in the circle. d The enriched genes in CD predicted by CAZyLingua only were prioritized based on a combination of the log fold change and the probability of the CAZyme annotation from CAZyLingua. The plot is ordered from the highest fold change and CAZyLingua prediction probability (red) to the lowest fold change and prediction probability (blue). CE17 is highlighted in bold. e Functional characterization of CE17 using MALDI-ToF MS. Peaks are labeled by degree of polymerization (DP) and number of acetyl (Ac) groups. The annotated m/z values indicate sodium adducts. Intensity is shown in arbitrary units (a.u.). Both the KTCE17 enzyme (middle) and a previously validated CE17, FpCE17 (bottom, [64]) showed the same activity on a RiGH26-pretreated β-mannan substrate, with disappearance of peaks signifying double and triple acetylated oligosaccharides, and decrease in the intensities of peaks signifying mono-acetylated oligosaccharides (containing 3-O-acetylations) and accumulation of deacetylated oligosaccharides

Update of

References

    1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59–65. - DOI - PMC - PubMed
    1. Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105–14. - DOI - PMC - PubMed
    1. Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell. 2019;176(3):649-662.e20. - DOI - PMC - PubMed
    1. Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499(7459):431–7. - DOI - PubMed
    1. Perdigão N, Heinrich J, Stolte C, Sabir KS, Buckley MJ, Tabor B, et al. Unexpected features of the dark proteome. Proc Natl Acad Sci USA. 2015;112(52):15898–903. - DOI - PMC - PubMed

LinkOut - more resources