Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct;20(5):882-898.
doi: 10.1016/j.gpb.2022.11.008. Epub 2022 Dec 6.

Machine Learning Modeling of Protein-intrinsic Features Predicts Tractability of Targeted Protein Degradation

Affiliations

Machine Learning Modeling of Protein-intrinsic Features Predicts Tractability of Targeted Protein Degradation

Wubing Zhang et al. Genomics Proteomics Bioinformatics. 2022 Oct.

Abstract

Targeted protein degradation (TPD) has rapidly emerged as a therapeutic modality to eliminate previously undruggable proteins by repurposing the cell's endogenous protein degradation machinery. However, the susceptibility of proteins for targeting by TPD approaches, termed "degradability", is largely unknown. Here, we developed a machine learning model, model-free analysis of protein degradability (MAPD), to predict degradability from features intrinsic to protein targets. MAPD shows accurate performance in predicting kinases that are degradable by TPD compounds [with an area under the precision-recall curve (AUPRC) of 0.759 and an area under the receiver operating characteristic curve (AUROC) of 0.775] and is likely generalizable to independent non-kinase proteins. We found five features with statistical significance to achieve optimal prediction, with ubiquitination potential being the most predictive. By structural modeling, we found that E2-accessible ubiquitination sites, but not lysine residues in general, are particularly associated with kinase degradability. Finally, we extended MAPD predictions to the entire proteome to find 964 disease-causing proteins (including proteins encoded by 278 cancer genes) that may be tractable to TPD drug development.

Keywords: Degradability; Machine learning; Protein-intrinsic feature; Targeted protein degradation; Ubiquitination.

PubMed Disclaimer

Conflict of interest statement

X. Shirley Liu is a cofounder, board member, and CEO of GV20 Therapeutics. Eric S. Fischer is a founder, member of the scientific advisory board (SAB), and equity holder of Civetta Therapeutics, Lighthorse Therapeutics, Proximity Therapeutics, and Neomorph Inc (board of directors), SAB member and equity holder in Avilar Therapeutics and Photys Therapeutics, and a consultant to Astellas, Sanofi, Novartis, Deerfield, and EcoR1 capital. The Fischer laboratory receives or has received research funding from Novartis, Deerfield, Ajax, Interline, and Astellas. Katherine A. Donovan is a consultant to Kronos Bio and Neomorph Inc. All the other authors declared no competing interests.

Figures

Figure 1
Figure 1
Study overview The ubiquitin–proteasome system can be repurposed by a PROTAC or other small molecules to degrade a POI. However, it remains to be answered which proteins are amenable to this approach (left). Here, we associated kinase degradability with protein-intrinsic features spanning protein expression, PTMs, protein length, PPIs, protein stability, and protein half-life to identify predictive factors (middle). Based on the predictive features, we developed a machine learning model to predict protein degradability (right). PROTAC, proteolysis targeting chimera; POI, protein of interest; PTM, post-translational modification; PPI, protein–protein interaction; Ub, ubiquitination; Ac, acetylation; P, phosphorylation; Su, sumoylation; Me, methylation.
Figure 2
Figure 2
Kinase degradability is associated with features intrinsic to the target A. Dot plot showing the frequency of degradation and maximal degradation of protein kinases induced by multi-kinase degraders from the study by Donovan and colleagues . Orange dots represent the kinases with high degradability, and light blue dots represent the kinases with low degradability. B. Pairwise Spearman’s correlation of 42 protein-intrinsic features spanning protein stability, PTM, PPI, protein length, protein half-life, protein expression, protein detectability, and others. C. Bar diagram showing the association between the degradability of kinases and their features. The x-axis shows the abbreviated name of protein-intrinsic features (see Table S2 for full details), in which the Ubiquitination_1 and Ubiquitination_2 indicate the proportion of lysine residues with reported ubiquitination events in at least one (Ubiquitination_1) or two references (Ubiquitination_2) from the PhosphoSitePlus database , the Zecha2018_HeLa_Halflife indicates the protein half-lives profiled in the HeLa cell line from the study by Zecha and colleagues , and the MOLT4_RNA indicates the mRNA expression in the MOLT4 cell line. The y-axis shows the Wilcoxon Z-statistics indicating the association between protein degradability and each protein-intrinsic feature. *, false discovery rate < 0.05; ns, not significant. FC, fold change.
Figure 3
Figure 3
Development of MAPD A. Precision–recall curves showing the performance of six machine learning models based on 20-fold cross-validation. B. Precision–recall curves showing the performance of MAPD and models trained on individual features or a combination of features. ‘PTMs’ indicates the model trained on the combination of ubiquitination potential (Ubiquitination_2), acetylation potential (Acetylation_1), and phosphorylation potential (Phosphorylation_2). ‘Ubiquitination_2’ indicates the model trained on ubiquitination potential. ‘HeLa_Halflife’ indicates the model trained on a single feature describing half-life in HeLa cells. ‘Length’ indicates the model trained on protein length. ‘Phosphorylation_2’ indicates the model trained on phosphorylation potential. MAPD, model-free analysis of protein degradability; RF, random forest; svmRadial, radial-kernel support vector machine; NB, naive Bayes; LR, logistic regression; svmLinear, linear-kernel support vector machine; KNN, k-nearest neighbor; AUPRC, area under the precision–recall curve.
Figure 4
Figure 4
MAPD shows good performance in predicting kinase degradability A. Venn diagram showing the overlap between kinases degraded by multi-kinase degraders from the study by Donovan and colleagues , PROTAC targets reported in PROTAC databases [including PROTAC-DB and PROTACpedia (https://protacdb.weizmann.ac.il/ptcb/main)], and degradable kinases identified by MAPD. B. Scatter plot showing the Spearman’s correlation between MAPD scores and frequencies of degradation of all degradable kinases from the study by Donovan and colleagues . C. Venn diagram showing the overlap between degradable kinases identified by MAPD, PROTACtable kinases , and ligandable kinases. D. Box plot showing ubiquitination potential of MAPD-specific targets and PROTACtable-specific targets. ****, P < 0.0001. E. Lollipop diagram showing the reported Ub sites in MAP3K4 (PROTACtable-specific target) and AGK (MAPD-specific target). The number in the circles indicates the number of references for each Ub site in PhosphoSitePlus  and the blank circle indicates that only one reference is available. The blue text near the circle indicates the location of the Ub site. F. Heatmap showing annotations of the top 50 predicted degradable kinases, with MAPD scores shown at the top. ‘PROTAC-DB’ and ‘PROTACpedia’ indicate whether a kinase has a developed degrader reported in the respective databases. The ‘multi-kinase degrader’ indicates whether a protein is degraded by a multi-kinase degrader. ‘DrugBank’ indicates whether a protein has FDA-approved drugs recorded in the DrugBank database . ‘ChEMBL’ indicates whether a protein has ligands recorded in the ChEMBL database . ‘Electrophiles’ indicate whether a protein has ligandable cysteines from the SLCABPP . The ‘OncoKB’ indicates whether a protein is considered to be encode by a cancer gene in the OncoKB database . The ‘ClinVar’ indicates whether the protein is associated with a disease in the ClinVar database . FDA, United States Food and Drug Administration; SLCABPP, streamlined cysteine activity-based protein profiling.
Figure 5
Figure 5
MAPD predictsproteome-widedegradability A. Box plot showing the MAPD scores of non-kinase PROTAC targets from PROTAC databases [including PROTAC-DB and PROTACpedia (https://protacdb.weizmann.ac.il/ptcb/main)] and other non-kinase drug targets from DrugBank . ****, P < 0.0001. B. Scatter plot showing the correlation between MAPD scores and the frequencies of degradation of IMiD targets by CRBN-recruiting degraders from the study by Donovan and colleagues . C. Ranked dot plot showing the MAPD scores of human TFs. TFs with reported degraders are labeled in the figure. The histogram at right shows the distribution of MAPD scores of all human TFs and the red dashed line shows the threshold for identifying degradable proteins by MAPD. D. Venn diagram showing the overlap of degradable non-kinase proteins between MAPD predictions and PROTACtable genome . E. Box plot showing the ubiquitination potential in MAPD-specific targets and PROTACtable-specific targets. ****, P < 0.0001. F. Heatmap showing annotations of the top 30 predicted degradable non-kinase proteins, with MAPD scores shown at the top. ‘PROTAC-DB’ and ‘PROTACpedia’ annotations indicate whether a kinase has a developed degrader reported in the respective databases. ‘DrugBank’ indicates whether a protein has FDA-approved drugs recorded in the DrugBank database . ‘ChEMBL’ indicates whether a protein has ligands recorded in the ChEMBL database . ‘Electrophiles’ indicate whether a protein has ligandable cysteines from the SLCABPP . ‘OncoKB’ indicates whether a protein is considered as a protein encoded by a cancer gene in the OncoKB database . ‘ClinVar’ indicates whether the protein is associated with a disease in the ClinVar database . TF, transcription factor; IMiD, immunomodulatory drug.
Figure 6
Figure 6
E2accessibility of Ub sites is associated with protein degradability A. Diagram showing how to estimate the accessibility of lysine/Ub sites to E2 enzyme in the degrader-induced ternary complex. The model of CDK1 (PDB: 4Y72) is docked to the CRBN-lenalidomide structure (PDB: 5FQD), which is shown as an example. The E3 ubiquitin ligase complex consists of CRBN, DDB1, CUL4A, and CUL4B, shown in green, pink, light gray, and gray, respectively. The CDK1 is the target protein, shown in yellow. The RBX1 fragment (shown in orange) is used to estimate the position of the E2 enzyme and the corresponding Ub zone in the target protein. Lysine/Ub sites in the Ub zone are estimated by drawing two planes with respect to the positions of CRBN and the target kinase. The sites lying in the quadrant facing the putative position of the E2 enzyme, estimated by the placement of RBX1, are considered accessible. The predicted E2-accessible and E2-inaccessible lysine residues are highlighted in blue and red, respectively. For each target protein, 200 top-scoring feasible models are selected for evaluating the accessibility of lysine residues to the E2 enzyme. For each Ub site, the fraction of feasible models with the site in the Ub zone is estimated as its E2 accessibility. B. Box plots showing the associations of kinase degradability with the total number of Ub sites (left) and the number of E2-accessible Ub sites (right) in the kinases, repectively. The E2-accessible Ub sites are defined as the Ub sites lying in the Ub zone of more than 50% of feasible models. C. Density plot showing the null distribution of Wilcoxon Z-statistics generated by shuffling Ub sites among all lysine residues 10,000 times. The red dashed line indicates the observed Wilcoxon Z-statistic representing the association between protein degradability and the number of E2-accessible Ub sites. D. Dot plot showing the total number of resolved Ub sites and the number of E2-accessible Ub sites. E. Box plot showing the number CRBN-recruiting degraders that degrade kinases with high (> 1) and low (≤ 1) levels of E2-accessible Ub sites. All kinases involved in this analysis have at least two reported Ub sites, which reduces the confounding effect derived from the difference in the total number of Ub sites.

References

    1. Glickman M.H., Ciechanover A. The ubiquitin-proteasome proteolytic pathway: destruction for the sake of construction. Physiol Rev. 2002;82:373–428. - PubMed
    1. Baumeister W., Walz J., Zühl F., Seemüller E. The proteasome: paradigm of a self-compartmentalizing protease. Cell. 1998;92:367–380. - PubMed
    1. Burslem G.M., Crews C.M. Small-molecule modulation of protein homeostasis. Chem Rev. 2017;117:11269–11301. - PubMed
    1. Liu J., Farmer J.D., Jr, Lane W.S., Friedman J., Weissman I., Schreiber S.L. Calcineurin is a common target of cyclophilin-cyclosporin A and FKBP-FK506 complexes. Cell. 1991;66:807–815. - PubMed
    1. Sakamoto K.M., Kim K.B., Kumagai A., Mercurio F., Crews C.M., Deshaies R.J. Protacs: chimeric molecules that target proteins to the Skp1-Cullin-F box complex for ubiquitination and degradation. Proc Natl Acad Sci U S A. 2001;98:8554–8559. - PMC - PubMed

Publication types