Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 11:6:0219.
doi: 10.34133/research.0219. eCollection 2023.

Inferring the Effects of Protein Variants on Protein-Protein Interactions with Interpretable Transformer Representations

Affiliations

Inferring the Effects of Protein Variants on Protein-Protein Interactions with Interpretable Transformer Representations

Zhe Liu et al. Research (Wash D C). .

Abstract

Identifying pathogenetic variants and inferring their impact on protein-protein interactions sheds light on their functional consequences on diseases. Limited by the availability of experimental data on the consequences of protein interaction, most existing methods focus on building models to predict changes in protein binding affinity. Here, we introduced MIPPI, an end-to-end, interpretable transformer-based deep learning model that learns features directly from sequences by leveraging the interaction data from IMEx. MIPPI was specifically trained to determine the types of variant impact (increasing, decreasing, disrupting, and no effect) on protein-protein interactions. We demonstrate the accuracy of MIPPI and provide interpretation through the analysis of learned attention weights, which exhibit correlations with the amino acids interacting with the variant. Moreover, we showed the practicality of MIPPI in prioritizing de novo mutations associated with complex neurodevelopmental disorders and the potential to determine the pathogenic and driving mutations. Finally, we experimentally validated the functional impact of several variants identified in patients with such disorders. Overall, MIPPI emerges as a versatile, robust, and interpretable model, capable of effectively predicting mutation impacts on protein-protein interactions and facilitating the discovery of clinically actionable variants.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of MIPPI. (A) MIPPI is designed to classify variant impacts on PPI as one of the 4 following categories: “increasing,” “decreasing,” “disrupting,” and “no effect.” (B) Unmutated (reference) as well as mutated sequences of the protein impacted by the missense mutation were cropped into a fragment with a length of 51 residues centered on the mutation site. Then, the first 1,024 residues of the partner protein were also designed as the input. If a partner protein were shorter than 1,024 residues, a series of 0s would be used as padding following the natural sequence until the length was sufficient. (C) Each position in the reference sequences, the mutated sequences, or the partner sequences was encoded as a 64D sequential feature (SF) spliced by a 20D PSSM profile and a 44D embedding vector. (D) Architecture of MIPPI. All the encoded protein sequences were fed into a deep learning framework. For a given mutation, MIPPI first provides each protein sequence position a 64D position encoding (PE) feature and concatenates it with the SF, making each input feature a 128D vector. Next, all categories of features are input into a deep neural network, which includes cascaded transformer encoders, residual blocks, a 1D convolutional layer, a GAP layer, and a SoftMax layer. Finally, MIPPI returns probabilities of all 4 output mutation impact categories to determine which one would be the most probable outcome.
Fig. 2.
Fig. 2.
Ablation study and the evaluation of the data augmentation strategy. (A) Ablation study to test the model contribution of different components of MIPPI. The He uniform initialization layer [25] was replaced by the Glorot uniform initialization layer [26], the GAP layer was replaced by a fully connected layer, and the input feature of PSSM was removed from the model separately. Detailed substructure test performance is presented in Table S5. (B) The ablation study showed the accuracy contribution of the He uniform initialization layer, the GAP layer, and the input feature of PSSM. (C) The ablation study of the partner protein verified the necessity of inputting the real partner protein to MIPPI. The “MIPPI-no partner” model was trained when the input of the PPI partner was removed. The “MIPPI-random real partner” model was trained by randomly selecting a protein in the original training set as the PPI partner input of MIPPI. The “MIPPI-random partner” model was trained by taking randomized sequences (not real protein sequences) as the PPI partner input of MIPPI. (D) In the same 2-class classification as (B), the ablation study showed the accuracy contribution of various input strategies for PPI partner information. (E) Proportions of ΔΔG entries predicted by MIPPI on the original and reversed entries of the SKEMPI v2 dataset described in Fig. S6, with data augmentation (DA) and without data augmentation (no DA). The upper panel shows the prediction results of the entries with ΔΔG > −1 kcal/mol (2,797 entries), and the lower panel shows the ones with ΔΔG ≤ −1 kcal/mol.
Fig. 3.
Fig. 3.
Mapping attention weights to protein sequences. (A) For each PPI partner protein in the training dataset, the top 5 residues with the highest attention weights in each attention head (Attention Head No.1 to No.4, 20,615 residues in total) were considered to explore the relationship between the attention weight of residues and their locations (in the PPI interface or not). Ninterface represents the number of residues in each attention head in the PPI interface, while Nnot interface represents the number of residues not located in the PPI interface. The attention weights here were exported from the last multihead self-attention layer, and the weights of padding positions were discarded. The Mann–Whitney U test was performed to test the distribution difference of the attention weights of residues. (B) Human signaling heterodimer (PDB structure ID: 1E96) with 2 chains (protein RAC1 as chain IE96_A and protein NCF2 as chain IE96_B) used to illustrate the weights of the second attention head of the last multihead self-attention layer from MIPPI models. RAC1 is shown in light green with a missense mutation (p.Asn26His, shown as orange). NCF2 is shown in light gray with multiple attention weights mapped on its surface (the darker the color, the higher the attention weight). (C) MIPPI model output of the attention weight distribution in 1E96_B at the residue level (total chain length: 185 AA). The green-shaded region represents the PPI interface.
Fig. 4.
Fig. 4.
Prediction of the special entries on the IMEx dataset and model evaluation against SIFT using missense variants from PsyMuKB [22] databases. (A) Prediction results of “causing” entries, compared with the whole training dataset prediction distribution. (B) Conflicting entry prediction distribution, classified by whether they have been reported in the literature. “With agreement” means that MIPPI’s prediction result is the same as at least one of those reported references. (C) Donut charts showing the distribution of the prediction results of MIPPI on the developmental delay (DD) correlated entries and the control entries from PsyMuKB. (D) Cosine similarity distribution of the 4 prediction scores from MIPPI varied from 0 to 100% between various partner protein pairs of the same DD-correlated mutated protein from the PsyMuKB dataset; all the mutations were pathogenic with a CADD score larger than 30. There are 396,604 partner pairs from 3,051 mutated proteins with 13,237 partners. (E) Co-IP experiment results. SETD2 reference and 3 selected variants (S1624C, Y1666C, and L1815T) interact with SMAD3 and TP53, with MIPPI prediction class and score.

Similar articles

Cited by

References

    1. Huttlin EL, Ting L, Bruckner RJ, Gebreab F, Gygi MP, Szpyt J, Tam S, Zarraga G, Colby G, Baltier K, et al. . The BioPlex network: A systematic exploration of the human interactome. Cell. 2015;162(2):425–440. - PMC - PubMed
    1. Matos B, Howl J, Jerónimo C, Fardilha M. The disruption of protein-protein interactions as a therapeutic strategy for prostate cancer. Pharmacol Res. 2020;161:105145. - PubMed
    1. Cummings CG, Hamilton AD. Disrupting protein–protein interactions with non-peptidic, small molecule α-helix mimetics. Curr Opin Chem Biol. 2010;14(3):341–346. - PubMed
    1. Zhang N, Chen Y, Lu H, Zhao F, Alvarez RV, Goncearenco A, Panchenko AR, Li M. MutaBind2: Predicting the impacts of single and multiple mutations on protein-protein interactions. Iscience. 2020;23(3):100939. - PMC - PubMed
    1. Rodrigues CH, Myung Y, Pires DEV, Ascher DB. mCSM-PPI2: Predicting the effects of mutations on protein–protein interactions. Nucleic Acids Res. 2019;47(W1):W338–W344. - PMC - PubMed

LinkOut - more resources