. 2023 Sep 26;120(39):e2303590120.

doi: 10.1073/pnas.2303590120. Epub 2023 Sep 20.

Prediction and design of protease enzyme specificity using a structure-aware graph convolutional network

Changpeng Lu¹, Joseph H Lubin², Vidur V Sarma¹, Samuel Z Stentz³, Guanyang Wang⁴, Sijian Wang^{1

4}, Sagar D Khare^{1

2}

Affiliations

¹ Institute for Quantitative Biomedicine, Rutgers-The State University of New Jersey, Piscataway, NJ 08854.
² Department of Chemistry and Chemical Biology, Rutgers-The State University of New Jersey, Piscataway, NJ 08854.
³ Verily Life Sciences, Boulder, CO 80302.
⁴ Department of Statistics, Rutgers-The State University of New Jersey, Piscataway, NJ 08854.

PMID: 37729196
PMCID: PMC10523478
DOI: 10.1073/pnas.2303590120

Prediction and design of protease enzyme specificity using a structure-aware graph convolutional network

Changpeng Lu et al. Proc Natl Acad Sci U S A. 2023.

. 2023 Sep 26;120(39):e2303590120.

doi: 10.1073/pnas.2303590120. Epub 2023 Sep 20.

Authors

Changpeng Lu¹, Joseph H Lubin², Vidur V Sarma¹, Samuel Z Stentz³, Guanyang Wang⁴, Sijian Wang^{1

4}, Sagar D Khare^{1

2}

Affiliations

¹ Institute for Quantitative Biomedicine, Rutgers-The State University of New Jersey, Piscataway, NJ 08854.
² Department of Chemistry and Chemical Biology, Rutgers-The State University of New Jersey, Piscataway, NJ 08854.
³ Verily Life Sciences, Boulder, CO 80302.
⁴ Department of Statistics, Rutgers-The State University of New Jersey, Piscataway, NJ 08854.

PMID: 37729196
PMCID: PMC10523478
DOI: 10.1073/pnas.2303590120

Abstract

Site-specific proteolysis by the enzymatic cleavage of small linear sequence motifs is a key posttranslational modification involved in physiology and disease. The ability to robustly and rapidly predict protease-substrate specificity would also enable targeted proteolytic cleavage by designed proteases. Current methods for predicting protease specificity are limited to sequence pattern recognition in experimentally derived cleavage data obtained for libraries of potential substrates and generated separately for each protease variant. We reasoned that a more semantically rich and robust model of protease specificity could be developed by incorporating the energetics of molecular interactions between protease and substrates into machine learning workflows. We present Protein Graph Convolutional Network (PGCN), which develops a physically grounded, structure-based molecular interaction graph representation that describes molecular topology and interaction energetics to predict enzyme specificity. We show that PGCN accurately predicts the specificity landscapes of several variants of two model proteases. Node and edge ablation tests identified key graph elements for specificity prediction, some of which are consistent with known biochemical constraints for protease:substrate recognition. We used a pretrained PGCN model to guide the design of protease libraries for cleaving two noncanonical substrates, and found good agreement with experimental cleavage results. Importantly, the model can accurately assess designs featuring diversity at positions not present in the training data. The described methodology should enable the structure-based prediction of specificity landscapes of a wide variety of proteases and the construction of tailor-made protease editors for site-selectively and irreversibly modifying chosen target proteins.

Keywords: geometric machine learning; machine learning; protease specificity; protein design; yeast surface display.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
Architecture of PGCN (A) Peptide substrate (blue) in the binding pocket (yellow) of HCV protease (gray). The seven-residue substrate spans P6 to P1′, with cleavage between P1 and P1′. The logo plot indicates the substrate sequences in the training set, where P1 and P1′ were kept constant, and P6 to P2 were variable. (B) Molecular depiction of the nodes and edges as a graph. Each substrate (blue) and binding pocket (yellow) amino acid constitutes a node of the graph. Gray lines between pairs of residues denote edges between pairs of nodes. (C) PGCN model architecture. Nodes are represented as a $N \times F$ matrix of nodes and node features. Edges are represented as a $N \times N \times M$ tensor of node pairs and edge features, flattened by the weighted sum of overall edge features. The PGCN model ultimately outputs probabilities of the given substrate belonging to each class, cleaved and uncleaved.

**Fig. 2.**
PGCN performance. We evaluate models on six datasets, four consisting of a single HCV variant (WT, A171T, D183A, or Triple (R170K, A171T, D183A)), with various substrates, and two, Combined, one of which pools the other four, and the other consists of 10 TEV variants. (A–C) The radar plots show polygon patterns of average test accuracies across three seeds of benchmarked ML models (labeled in different colors) on the five datasets. The highest accuracy is on the polygon periphery. (D) Accuracy barplot of model prediction performance on the TEV Combined protease specificity data. In this analysis, the substrate were clustered based their sequence to ensure that training, validation, and test sets have distinct sequence patterns.

**Fig. 3.**
Node/edge importance analysis for HCV protease. (A) Relative importance of major physical groups of nodes/edges. Node/edge importance is analyzed by the decrease in accuracy upon perturbation normalized by the original test accuracy of the PGCN model trained on each of five HCV protease variants using sequence-only (S), energy-only (E), or both (S+E). All nodes and edges in the PGCN graph are grouped into five major physical groups, and each group’s total relative importance is aggregated as the sum of importance scores from nodes (or edges) of the specified type. (B–E) Structural representations of node/edge importance for HCV protease (PDB ID 3M5N; 3M5L), calculated by the energy-only PGCN model trained on B wild type; (C) A171T; (D) D183A; (E) Triple data. Only nodes/edges whose relative importance scores are within the first 25% quantile are displayed in the zoom-in protease structure (gray), with the substrate (cyan) as the center and the catalytic triad (magenta) as the reference. Among those nodes that meet the criteria, the relative importance levels of protease nodes (orange) are shown by the thickness of corresponding residue side chains, while that of peptide nodes (cyan) is reflected by the sizes of corresponding residue spheres at CB. For those important edges that meet the criteria, different groups are highlighted in different colors, including peptide edges (blue), protease edges (green), and intermolecular edges (yellow). All residues related to node/edge importance have labels in residue identifiers with one-letter codes (colored in red if mutation sites), including protease residues that are only related to edges (violet).

**Fig. 4.**
Node and edge importance contribution for TEV. (A) Relative importance of major physical groups of nodes/edges. (B–E) Partial structural representations of node and edge importance in the context of TEV (PDB ID 1LVB; 1LVM) to show: (B) important protease nodes; (C) intermolecular edges that presumably form hydrogen bonds; (D) the subpocket surrounding P3, including P3-S170, P3-F172, P3-F217; (E) the subpocket surrounding P1, including intermolecular edges P1-D148, P1-H167, P1-S170. The same color setting is used as in Fig. 3. The catalytic triad in each structure is just the reference of relative position. All mutations are highlighted in red.

**Fig. 5.**
PGCN generalizability. (A) AUC for HCV cross-test among HCV WT, HCV A171T, HCV D183A, HCV Triple data using sequence+energy features. (B) AUC for leave-one-out tests of three TEV variants which have unique mutations among all TEV variants.

**Fig. 6.**
Pipeline for TEV protease design, including procedures of (A) computational design, (B) PGCN prediction, and (C) yeast-based assay testing using FACS, and (D) flow cytometry–based analysis of individual colonies. (A) S2 and S6 Pockets corresponding to altered substrate residues were inferred from the crystal structure of a TEV protease–substrate complex, and redesigned using Rosetta. (B) Rosetta-generated designs were evaluated using a pretrained PGCN model and mutations enriched in high-scoring designs were identified and used to generate combinatorial protease libraries. (C) These libraries were screened using the YESS assay with P6 and P2 variant substrates, and pools of cells corresponding to cleaved and uncleaved protease:substrate variant pairs were isolated using FACS. (D) Individual colonies isolated from cleaved and uncleaved pools were tested using flow cytometry. 2-D scatter plots for positive P2-targeted (*Top-Left*) and P6-targeted (*Top-Right*) designs and negative designs (P2: *Bottom-Left*, P6: *Bottom-Right*) are shown. (E) Comparison table of PGCN prediction and experimental results on 19 clonally tested designs.

**Fig. 7.**
TEV input data variety. (A) TEV protease mutation sites (gray) are shown all together in the TEV WT protease structure, located around the substrate (cyan). (B) Substrate sequence logos for TEV input data from P6-P1′ positions. X axis splits data into cleaved (above x axis) and uncleaved (below x axis) populations, where the higher frequency of amino acids appeared in the cleaved population, the higher it is located in the logo plot; the lower if considering the uncleaved population. (C) Sankey diagram of mutation sites for TEV variants, combined with a horizontal barplot, showing the number of samples for different TEV variants (along x axis). Variants are named based on their mutation sites except for the following four variants: Var1 (T146S_D148P_S153N_S170A_N177M), Var2 (E107D_D127A_S135F_R203Q_K215E), Var3 (T17S_N68D_E107D_D127A_F132L_S135F_F162S_K229E), and L2F. See Dataset S5 for complete mutation sites in the table.

See this image and copyright information in PMC

Update of

Prediction and Design of Protease Enzyme Specificity Using a Structure-Aware Graph Convolutional Network.
Lu C, Lubin JH, Sarma VV, Stentz SZ, Wang G, Wang S, Khare SD. Lu C, et al. bioRxiv [Preprint]. 2023 Feb 16:2023.02.16.528728. doi: 10.1101/2023.02.16.528728. bioRxiv. 2023. Update in: Proc Natl Acad Sci U S A. 2023 Sep 26;120(39):e2303590120. doi: 10.1073/pnas.2303590120. PMID: 36824945 Free PMC article. Updated. Preprint.

Cited by

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning.
Clark JD, Mi X, Mitchell DA, Shukla D. Clark JD, et al. Digit Discov. 2024 Dec 2;4(2):343-354. doi: 10.1039/d4dd00170b. eCollection 2025 Feb 12. Digit Discov. 2024. PMID: 39649639 Free PMC article.
Data-driven protease engineering by DNA-recording and epistasis-aware machine learning.
Huber L, Kucera T, Höllerer S, Borgwardt K, Panke S, Jeschek M. Huber L, et al. Nat Commun. 2025 Jul 1;16(1):5466. doi: 10.1038/s41467-025-60622-7. Nat Commun. 2025. PMID: 40593579 Free PMC article.
Advances in ligand-specific biosensing for structurally similar molecules.
Xi C, Diao J, Moon TS. Xi C, et al. Cell Syst. 2023 Dec 20;14(12):1024-1043. doi: 10.1016/j.cels.2023.10.009. Cell Syst. 2023. PMID: 38128482 Free PMC article. Review.
Protease engineering: Approaches, tools, and emerging trends.
Martinusen SG, Nelson SE, Slaton EW, Long LF, Pho R, Ajayebi S, Denard CA. Martinusen SG, et al. Biotechnol Adv. 2025 Sep;82:108602. doi: 10.1016/j.biotechadv.2025.108602. Epub 2025 May 12. Biotechnol Adv. 2025. PMID: 40368116 Free PMC article. Review.
Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning.
Clark JD, Mi X, Mitchell DA, Shukla D. Clark JD, et al. ArXiv [Preprint]. 2024 Feb 23:arXiv:2402.15181v1. ArXiv. 2024. Update in: Digit Discov. 2024 Dec 2;4(2):343-354. doi: 10.1039/d4dd00170b. PMID: 38463513 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. Tang S., et al. , Mechanism-based traps enable protease and hydrolase substrate discovery. Nature 602, 701–707 (2022). - PMC - PubMed
1. Erijman A., Aizner Y., Shifman J. M., Multispecific recognition: Mechanism, evolution, and design. Biochemistry 50, 602–611 (2011). - PubMed
1. Vizovišek M., et al. , Protease specificity: Towards in vivo imaging applications and biomarker discovery. Trends Biochem. Sci. 43, 829–844 (2018). - PubMed
1. Mason S. D., Joyce J. A., Proteolytic networks in cancer. Trends Cell Biol. 21, 228–237 (2011). - PMC - PubMed
1. Sanman L. E., Bogyo M., Activity-based profiling of proteases. Annu. Rev. Biochem. 83, 249–273 (2014), 10.1146/annurev-biochem-060713-035352. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction and design of protease enzyme specificity using a structure-aware graph convolutional network

Affiliations

Prediction and design of protease enzyme specificity using a structure-aware graph convolutional network

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources