Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug;297(2):100931.
doi: 10.1016/j.jbc.2021.100931. Epub 2021 Jul 1.

Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases

Affiliations

Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases

Japheth E Gado et al. J Biol Chem. 2021 Aug.

Abstract

Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.

Keywords: Trichoderma reesei; bioinformatics; cellulase; glycoside hydrolase; statistics; tryptophan.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest The authors declare that they have no conflicts of interest with the contents of this article.

Figures

Figure 1
Figure 1
Structures of typical GH7 CBH and EG with a cellononaose ligand in complex.A, the CBH (left), Trichoderma reesei Cel7A (TreCel7A, PDB code: 4C4C) (23), and the EG (right), Trichoderma reesei Cel7B (TreCel7B, PDB code: 1EG1) (26). The eight active-site loops (A1–A4 and B1–B4) are shown in red. In the CBH, the active site is tunnel-like, but is more open and groove-like in the EG. B, glycosyl binding sites are numbered from the nonreducing end at the active-site tunnel entrance (–7) to the reducing end (+2) where the cellobiose product exits the active site. Bond cleavage occurs between –1 and +1 subsites.
Figure 2
Figure 2
Discrimination of GH7 CBHs and EGs with hidden Markov models (HMM).A, 5-fold cross-validation technique for evaluating the performance of HMM. The MSA is split into CBH and EG subalignments and each subalignment into 5-folds. HMMs are repeatedly trained on 4-folds and then tested on the left-out fold. The predicted class (CBH or EG) of a sequence is the class that yields the highest HMM alignment score. B, performance of HMM on the dataset of 44 GH7s from the manually curated UniProtKB/SwissProt database. C, performance of HMM on the dataset of 427 GH7s from NCBI nonredundant database. Only two EG sequences (GenBank accession codes: AGY80096.1 and AGY80097.1) were misclassified in the NCBI dataset. Note that in B and C, the assigned sequence numbers (x-axes) are arbitrary.
Figure 3
Figure 3
Generating features for discriminating GH7 CBHs and EGs with ML.A, segments of a selection of six well-studied GH7s from the structure-based sequence alignment of 1748 sequences showing the active-site loops. The sequences include the CBHs: Trichoderma reesei Cel7A (TreCel7A) (23), Penicillium funiculosum Cel7A (PfuCel7A) (20), and Phanerochaete chrysosporium Cel7D (PchCel7D) (19); and the EGs: Trichoderma reesei Cel7B (TreCel7B) (26), Fusarium oxysporum Cel7B (FoxCel7B) (24), and Humicola insolens Cel7B (HinCel7B) (25). B, the number of residues in the eight active-site loops as determined from the structure-based alignment. C, procedure for generating features for 1748 GH7s. First, the sequences are aligned as in (A). Then, a count of the number of residues in each loop is obtained. Residue counts are scaled to Z-scores before ML is applied.
Figure 4
Figure 4
Procedure for evaluating the performance of ML models using 100 repetitions of 5-fold cross-validation with undersampling. The dataset is reshuffled and resampled in each repetition.
Figure 5
Figure 5
Predictive performance and variation of active-site loops in GH7s. A, Matthews’ correlation coefficient (MCC) values of four ML algorithms trained separately on the length of each active-site loop and on all eight loops together. The A4, B2, B3, and B4 loops achieve near-perfect performance in discriminating 1748 GH7 CBHs and EGs. Box and whisker plots indicate distribution of MCC values over 100 repetitions of 5-fold cross-validation (center line: median, box limits: upper/lower quartiles, whiskers: full data range). B, distribution of the lengths of active-site loops in 1306 GH7 CBHs and 442 GH7 EGs. Box and whisker plots are as in (A). C, the relative standard deviation of the length of the eight active-site loops. Generally, variation in the length of a loop correlates with predictive performance of the loop as a ML feature. D, rules derived from the single-node decision trees trained on the lengths of the A4, B2, B3, and B4 loops. The accuracy of the rules in discriminating GH7 CBHs and EGs, i.e., the sensitivity and specificity, respectively, are shown in brackets.
Figure 6
Figure 6
Pearson’s correlation coefficient between the lengths of the eight active-site loops in 1748 GH7s. The matrix of correlation coefficients is clustered so that loops with a similar pattern of correlation are grouped together. There is a high degree of positive correlation (darker red) between the lengths of the A4, B2, B3, and B4 loops.
Figure 7
Figure 7
Top-performing position-specific classification rules for discriminating GH7 CBHs and EGs.A, amino acid distribution of GH7 CBHs and EGs at position 40 (TreCel7A numbering). Position 40 is strongly conserved as Trp in GH7 CBHs but not in EGs. B, MCC scores of 1799 position-specific classification rules derived from the MSA. The top 90 rules have MCC scores of 0.73 or greater. C, Histogram of minimum distance between the cellononaose ligand in TreCel7A (PDB code: 4C4C) (23) and positions from which the top 90 classification rules are derived. More than half of top 90 rules are derived from positions within 5 Å of the substrate. D, alpha carbons of 42 positions from which the top 90 classification rules are derived shown on the structure of TreCel7A. Most of these positions are near the substrate sites toward to the nonreducing end (NRE). E, posterior view of crystal structure.
Figure 8
Figure 8
Conserved aromatic residues in the active site of TreCel7A (PDB code:4C4C) within 6 Å of the cellononaose ligand. Residues in magenta are conserved (>66% frequency) in both GH7 CBHs and EGs and are found close to the catalytic center between –1 and +1 glycosyl subsites. Residues in blue are conserved in GH7 CBHs but not in EGs and flank the catalytic center.
Figure 9
Figure 9
Top-performing features of the random forest classifier in predicting the presence of CBMs in GH7s.A, relative importance (Gini) of all 5933 features derived from one-hot encoding of the MSA. Most features provide little information to the model. B, relative importance (Gini) of top 20 features in the random forest classifier retrained on only top 20 features. Box and whisker plots indicate the distribution over 100 repetitions of 5-fold cross-validation (center line: median, box limits: upper/lower quartiles, whiskers: full data range). C, residues of top 20 features (green sticks) shown on the structure of TreCel7A (tan cartoon) on cellulose (gray sticks). The structure is derived from a snapshot (t = 0.73 μs) of MD simulations conducted in a previous work (97).
Figure 10
Figure 10
Frequency of Cys at positions forming disulfide bonds in GH7 sequences. Cys positions (x-axis) are labeled using TreCel7A numbering and the frequencies were determined from the structure-based MSA (1748 sequences). GH7 sequences may have up to ten disulfide bonds, nine of which are present in roughly at least 80% of the sequences. A rare disulfide bond, formed by C4 and C72 in TreCel7A, is present in less than 10% of GH7 sequences and is virtually absent in EGs. Overall, disulfide bonds are more prevalent in GH7 CBHs than EGs.

References

    1. Himmel M.E., Ding S.Y., Johnson D.K., Adney W.S., Nimlos M.R., Brady J.W., Foust T.D. Biomass recalcitrance: Engineering plants and enzymes for biofuels production. Science. 2007;315:804–807. - PubMed
    1. Payne C.M., Knott B.C., Mayes H.B., Hansson H., Himmel M.E., Sandgren M., Stahlberg J., Beckham G.T. Fungal cellulases. Chem. Rev. 2015;115:1308–1448. - PubMed
    1. Lynd L.R., Weimer P.J., van Zyl W.H., Pretorius I.S. Microbial cellulose utilization: Fundamentals and biotechnology. Microbiol. Mol. Biol. Rev. 2002;66:506–577. table of contents. - PMC - PubMed
    1. Zhang Y.H.P., Lynd L.R. Toward an aggregated understanding of enzymatic hydrolysis of cellulose: Noncomplexed cellulase systems. Biotech. Bioeng. 2004;88:797–824. - PubMed
    1. Bu L., Nimlos M.R., Shirts M.R., Stahlberg J., Himmel M.E., Crowley M.F., Beckham G.T. Product binding varies dramatically between processive and nonprocessive cellulase enzymes. J. Biol. Chem. 2012;287:24807–24813. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources