This is a preprint.
Protein Language Models Uncover Carbohydrate-Active Enzyme Function in Metagenomics
- PMID: 37961379
- PMCID: PMC10634757
- DOI: 10.1101/2023.10.23.563620
Protein Language Models Uncover Carbohydrate-Active Enzyme Function in Metagenomics
Update in
-
Protein language models uncover carbohydrate-active enzyme function in metagenomics.BMC Bioinformatics. 2025 Nov 26;26(1):285. doi: 10.1186/s12859-025-06286-y. BMC Bioinformatics. 2025. PMID: 41299229 Free PMC article.
Abstract
In metagenomics, the pool of uncharacterized microbial enzymes presents a challenge for functional annotation. Among these, carbohydrate-active enzymes (CAZymes) stand out due to their pivotal roles in various biological processes related to host health and nutrition. Here, we present CAZyLingua, the first tool that harnesses protein language model embeddings to build a deep learning framework that facilitates the annotation of CAZymes in metagenomic datasets. Our benchmarking results showed on average a higher F1 score (reflecting an average of precision and recall) on the annotated genomes of Bacteroides thetaiotaomicron, Eggerthella lenta and Ruminococcus gnavus compared to the traditional sequence homology-based method in dbCAN2. We applied our tool to a paired mother/infant longitudinal dataset and revealed unannotated CAZymes linked to microbial development during infancy. When applied to metagenomic datasets derived from patients affected by fibrosis-prone diseases such as Crohn's disease and IgG4-related disease, CAZyLingua uncovered CAZymes associated with disease and healthy states. In each of these metagenomic catalogs, CAZyLingua discovered new annotations that were previously overlooked by traditional sequence homology tools. Overall, the deep learning model CAZyLingua can be applied in combination with existing tools to unravel intricate CAZyme evolutionary profiles and patterns, contributing to a more comprehensive understanding of microbial metabolic dynamics.
Figures
References
-
- Rinke C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013). - PubMed
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources