Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Oct 25:2023.10.23.563620.
doi: 10.1101/2023.10.23.563620.

Protein Language Models Uncover Carbohydrate-Active Enzyme Function in Metagenomics

Affiliations

Protein Language Models Uncover Carbohydrate-Active Enzyme Function in Metagenomics

Kumar Thurimella et al. bioRxiv. .

Update in

Abstract

In metagenomics, the pool of uncharacterized microbial enzymes presents a challenge for functional annotation. Among these, carbohydrate-active enzymes (CAZymes) stand out due to their pivotal roles in various biological processes related to host health and nutrition. Here, we present CAZyLingua, the first tool that harnesses protein language model embeddings to build a deep learning framework that facilitates the annotation of CAZymes in metagenomic datasets. Our benchmarking results showed on average a higher F1 score (reflecting an average of precision and recall) on the annotated genomes of Bacteroides thetaiotaomicron, Eggerthella lenta and Ruminococcus gnavus compared to the traditional sequence homology-based method in dbCAN2. We applied our tool to a paired mother/infant longitudinal dataset and revealed unannotated CAZymes linked to microbial development during infancy. When applied to metagenomic datasets derived from patients affected by fibrosis-prone diseases such as Crohn's disease and IgG4-related disease, CAZyLingua uncovered CAZymes associated with disease and healthy states. In each of these metagenomic catalogs, CAZyLingua discovered new annotations that were previously overlooked by traditional sequence homology tools. Overall, the deep learning model CAZyLingua can be applied in combination with existing tools to unravel intricate CAZyme evolutionary profiles and patterns, contributing to a more comprehensive understanding of microbial metabolic dynamics.

PubMed Disclaimer

Figures

Extended Data Figure 1.
Extended Data Figure 1.. Embedding weights from first layer to next, no interpretable chemical features.
We extracted the weights (W) from the CAZyLingua multiclass classifier between the input layer and first hidden layer, which is a matrix of dimension 1024×256. After applying a transpose to get WT we multiplied the two matrices, W · WT which produced a symmetric matrix, S of dimensions 1024×1024. After taking the diag(S) we obtained a vector of size 1024, which is the size of the original embedding from ProtT5. We plotted the values in the vector to visualize if there were any features or positions in specific regions of the embedding that are specific to CAZymes.
Extended Data Figure 2.
Extended Data Figure 2.. Training runs for finding the best model.
RayTune ran 20 models in parallel over each epoch and pruned any models that began to stagnate or have a decline in training accuracy. The models were evaluated on the metric of minimizing training loss, and the model with the minimal loss was stored as a checkpoint. There were 100 epochs over which training occurred, and the metrics were stored and written to a TensorBoard that produced these visualizations.
Figure 1.
Figure 1.. CAZyLingua: a deep learning model used for the classification of proteins as CAZymes.
a) The workflow of CAZyLingua starts with raw embeddings from ProtT5 followed by the use of those embeddings as input through two classifiers to distinguish 1) whether the embedding was a CAZyme and if so, 2) to which CAZyme family it belongs to. b) The training strategy for CAZyLingua began with a 60% sequence identity clustering to remove redundancy from the CAZy database in order to train on distinct CAZymes. The Cross Entropy loss function was applied for training and the loss function that was used included a weighted balancing function to proportionally sample the number of representative sequences per CAZyme class/family/subfamily in the database. This strategy was employed so as not to oversample on highly represented families.
Figure 2.
Figure 2.. CAZyLingua performance relative to the BLAST-based CAZyme annotation tool dbCAN2.
CAZyLingua was compared to the dbCAN2 DIAMOND+CAZy annotation tool option (benchmarked with an e-value < 1×10−102). A similar procedure as dbCAN2 was followed by picking 3 bacterial strains with manual annotations and varying CAZyme counts per strain. a) For predictions by CAZyLingua only, dbCAN2 only, and shared between the two methods, the proportion of correct predictions made by each method (left) and the proportion of true CAZymes made by each method (right) were calculated. b) F1 scores (harmonic means of precision and recall) of all CAZyLingua predictions, all dbCAN2 predictions, and all predictions combined, whether shared between the methods or not. c) Ground truth CAZymes were stratified by class, and the percentage of accurate predictions per CAZy class from our Quadratic Discriminant Analysis (QDA) binary classifier was calculated. d) Precision/recall (left) and ROC (right) curves comparing CAZyLingua to dbCAN2. The output of the decision function of the boundary that was trained for CAZyLingua and the e-value for dbCAN2 were used for target scores.
Figure 3.
Figure 3.. Application of CAZyLingua to metagenomes in paired mothers and infants.
a) Comparison of CAZyLingua to eggNOG and dbCAN2 on a large metagenomics gene catalog from mothers and their infants. Time of the sample is in months relative to childbirth (month 0). Dotted lines represent no fold change. b) CAZyLingua predicted 27,133 genes that dbCAN2 did not, shown by CAZy class for all infant and maternal samples at each sample month. Boxplots in a and b show medians and interquartile ranges (IQRs), with whiskers showing ± 1.5 IQR. c) Predicted structures of proteins from CAZyLingua (red) and the protein embedding nearest neighbor (grey) structurally aligned with TM scores, and BLAST metrics for GH88, GH10, and GH63.
Figure 4.
Figure 4.. CAZyLingua distinguishes GH33 CAZyme from nearest neighbors of raw ProtT5 embeddings.
a) tSNE of (left) ProtT5 embeddings from the GH33 and GH43_18 families and the CAZyme predicted by CAZyLingua (GH unknown) and (right) a segment of the last layer of CAZyLingua. b) GH33 protein residues were mutated in a sliding window of ten residues over the entire sequence, and ProtT5 embeddings were generated for each sliding window mutation. Known features are overlaid along sections of the sequence. The probability of the CAZyLingua-predicted classification being a GH33 was calculated for each sliding window mutation (top). The predicted GH mapped to a PUL containing several regulatory elements consistent with a CAZyme (bottom left). BLAST metrics on the predicted GH signal peptide compared with GH33 and GH43_18 sequences (bottom right). c) Overlays of the predicted GH protein structure generated using ColabFold with a sialidase (top) and a neuraminidase (bottom).
Figure 5.
Figure 5.. Application of CAZyLingua to CAZymes in metagenomes of patients with inflammatory and fibrosis-prone diseases.
Genes enriched and depleted in the gene catalogs of patients with a) CD and b) IgG4-RD selected on the fringe of the volcano plot (see Methods for labeling criteria). c) Predicted CEs in the enriched IgG4-RD gene set, stratified to analyze only the genes CAZyLingua predicted. d) The proportion of dbCAN2-predicted CAZymes also predicted by CAZyLingua as the decision function between CAZyme/non-CAZyme of the QDA classifier in CAZyLingua was varied. The Venn diagram shows the numbers of CAZymes predicted by CAZyLingua, dbCAN2, and both on our current model benchmarks of the QDA.

References

    1. Qin J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). - PMC - PubMed
    1. Almeida A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021). - PMC - PubMed
    1. Pasolli E. et al. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle. Cell 176, 649–662.e20 (2019). - PMC - PubMed
    1. Rinke C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013). - PubMed
    1. Perdigão N. et al. Unexpected features of the dark proteome. Proc. Natl. Acad. Sci. U. S. A. 112, 15898–15903 (2015). - PMC - PubMed

Publication types