Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 28;381(6656):eadh1720.
doi: 10.1126/science.adh1720. Epub 2023 Jul 28.

Deploying synthetic coevolution and machine learning to engineer protein-protein interactions

Affiliations

Deploying synthetic coevolution and machine learning to engineer protein-protein interactions

Aerin Yang et al. Science. .

Abstract

Fine-tuning of protein-protein interactions occurs naturally through coevolution, but this process is difficult to recapitulate in the laboratory. We describe a platform for synthetic protein-protein coevolution that can isolate matched pairs of interacting muteins from complex libraries. This large dataset of coevolved complexes drove a systems-level analysis of molecular recognition between Z domain-affibody pairs spanning a wide range of structures, affinities, cross-reactivities, and orthogonalities, and captured a broad spectrum of coevolutionary networks. Furthermore, we harnessed pretrained protein language models to expand, in silico, the amino acid diversity of our coevolution screen, predicting remodeled interfaces beyond the reach of the experimental library. The integration of these approaches provides a means of simulating protein coevolution and generating protein complexes with diverse molecular recognition properties for biotechnology and synthetic biology.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Design and validation of protein-protein coevolution strategy.
(A) A schematic representation of protein-protein coevolution workflow. The α-agglutinin yeast surface display system was used to display two proteins connected by a flexible linker. A 3C protease site within the linker enabled cleavage, and the interacting proteins can be captured by C-terminally bound anti-HA antibody (red). (B) Close-up view of key residues in the hydrophobic cavity of Z domain (green) and affibody ZpA963 (blue) (PDB: 2M5A). Encoded amino acids are used for two separate libraries, HL1 and HL2 (bottom). (C) On-yeast cleavage-capture assay of interacting pair (Z+ZpA963) and non-interacting pair (6xAla). Data are mean ± SD; n = 3 independent replicates. (D) Correlation between on-yeast cleavage-capture assay and binding affinity measured of Z domain-affibody dimer mutants measured by SPR. Note that on-yeast cleavage-capture assay shows a strong semilog-linear relationship (R2 = 0.8382) with binding affinity (pKD). (E) Histogram of the flow cytometric analysis. Note that HA-tag fluorescence in the library shows strong enrichment after MACS (PM) and FACS (PF) for HL1 and HL2 libraries. (F) Sequence frequency logo of NGS data in the naïve library and post final round of FACS. The original sequence (FLI+FIL) is derived from Z domain (A) and ZpA963 (B) dimer. Note that the libraries converged back to the original sequences either exactly or with minimal variations. The color scheme represents hydrophobic (black), polar (green), basic (blue), acidic (red), and neutral (purple) amino acids. (G) On-yeast cleavage-capture assay of the six most frequent mutants from HL1 and HL2 NGS data. The sequence of each mutant (1: FII+FIL, 2: FLI+FIL, 3: FII+FVL, 4: FLI+FVL, 5: FLI+FII, 6: FII+FII) Note that all six mutants show different levels of steady-state binding of HA-tag fluorescence during 3C protease cleavage. Data are mean ± SD; n = 3 independent replicates.
Figure 2.
Figure 2.. Engineering remodeled dimer interfaces by coevolution.
(A) Library positions on the interface (top) from the complex of Z domain (green, chain A) and ZSPA-1 (blue, chain B) (PDB: 1LP1). Encoded amino acids used for making two separate libraries, LL1 and LL2 (bottom). (B) Flow cytometry dot plots showing enrichment of HA-tag fluorescence (red squares) in the library after rounds 6 to 8 (left). Antibody-labeled yeast cells were cleaved with 3C protease for 30 min. Cells were pre-gated on c-Myc+. Histograms showing elevation of HA-tag fluorescence during selection, from round 6 (green), to 7 (blue) and 8 (red) (right). (C) Sequence frequency logo of NGS data in naïve library, rounds 6, 7, and 8, revealing the appearance of consensus sequences as the selection proceeded in both LL1 and LL2 libraries. The original sequence (QFLIK+LVIF) is derived from Z domain (A) and ZSPA-1 (B) dimer. The color scheme represents hydrophobic (black), polar (green), basic (blue), acidic (red), and neutral (purple) amino acids. (D) On-yeast cleavage-capture assay of the mutants from LL1 (left) and LL2 (right) library. The altered positions compared to original amino acids are colored in red. Data are mean ± SD; n = 3 independent replicates.
Figure 3.
Figure 3.. Visualization and mapping of coevolutionary networks.
(A) The sequence logo of Z-B sequences paired with each Z-A sequence from the statistically enriched NGS data (p-value < 0.05) and actual binding specificity measured by on-yeast cleavage-capture assay, normalized to the highest affinity of each Z-A sequence (below). Filtered sequences accurately predicted binding specificity, matching the actual binding specificity of each Z-A sequence. (B) Sequence similarity networks (SSNs) of concatenated 8 amino acid Z-A/Z-B library position sequences from all screening rounds (left) and round 7 (right) of LL2 library. Notable Z-A sequences are colored and specified in the panel (right). The edit distance threshold for connecting nodes in the total library network is 2 and in the round 7 network is 1. The left SSN is colored by screening round and demonstrates connectivity among sequences from later screening rounds (rounds 5 to 7). The right SSN is colored by Z-A sequence and provides a detailed view of the enriched stage (round 7), showing cluster formation based on Z-A specificities. (C) Circos cross-reactivity plot of 100 sampled pairs from LL1 and LL2 round 7 sequence data. The Circos plots illustrate the pairwise relationships between the 100 sampled pairs of Z-A and Z-B proteins. Each pair is normalized to have equal area, providing a visual representation of the approximate cross-reactivity of each sequence. (D) A single mutational pathway of mutants from the LL2 library connecting the original sequence (QFLI/LVIF) with the prominent LL2 library mutants. Mutated positions are color-coded: red (one mutation), green (two mutations), and blue (three mutations). The number of mutations at each position is represented by a 4-digit number next to each Z-A and Z-B sequence (E) A plot illustrating the changes in ΔΔG, ΔΔH, and −ΔTΔS for three mutants in the pathway (D) compared to the original pair (QFLI/LVIF). Mutations introduced in each step are highlighted in red. (F) A matrix to show binding specificity changes of the Z-A variants from the pathway. Binding affinities measured by on-yeast cleavage-capture assay were normalized based on the highest affinity in each Z-A sequence. The single mutation introduced at each step is indicated in red. The highest affinity pair in each column was boxed in green. Control is a mutant with all library positions mutated to alanines. Data are mean of n = 3 independent replicates. .
Figure 4.
Figure 4.. Coupling analysis and structural adaptation of coevolved variants.
(A) DCA matrix to predict inter-residue covariation of LL2 library sequences (round 6 and 7). The DCA scores are normalized between 0 and 1. The pairs with the highest DCA scores, 13A-9B and 17A-31B, are marked with red squares. The matrix rows represent residues from Z-A, columns represent residues from Z-B, and the elements represent the statistical dependencies between residues. Through the inverse covariance matrix analysis, the pairs 13A-9B and 17A-31B were identified as strongly interacting pairs, indicating their direct contact in the 3D structures. (B) Inter-residue contacts (left), and the relationship between DCA and inter-residue distance is measured from the original pair structure (right) (PDB: 1LP1). The dashed lines are color-coded (from purple to yellow) based on DCA matrix in panel (A). The top two highest DCA contacts (Leu 17A – Ile 31B, Phe 13A – Leu 9B) are colored in red. The overall relationship between inter-residue distance and DCA score was weak (R2 = 0.0203). (C-E) Close-up views of library positions to show local side chain rearrangements. Pairs of residues at the center of the dimer interface were mutated in a compensatory manner between 13A and 9B (C) and between 17A and 31B (D). Side chain substitutions from 4 different interacting pairs are shown as sticks (E) Library positions 9A and 32B are closely associated with proximal residues, Gln10A and Trp35B, maintaining the shape complementarity between two proteins. In the bottom left, B chains of seven interacting pairs are aligned, with close up views of the boxed region shown for each pair. Coupled side chains are shown as sticks with transparent spheres to indicate packing interactions.
Figure 5.
Figure 5.. Specificity determinants of orthogonal high-affinity mutants.
(A) The altered positions compared to the original amino acids are colored red, and varying positions between mutants are highlighted with green boxes. (B) A table of affinity between Z-A and Z-B monomers measured by SPR. LL1.c1 and c2 are orthogonal to each other and B-FIVF of LL1.c6 are cross-reactive to both Z-A mutants. (C) Comparison of LL1.c2 and LL1.c6 structures near position 32B shows how the single mutation M32BF induces large conformational changes by side chain rotation of Trp35B and increased hydrophobic interactions around it. Superposition of overall structures of LL1.c1, LL1.c2 and LL1.c6 (left). Close-up views of each mutant show Trp35-centered hydrophobic interactions with surrounding residues (right). Position 32 is highlighted with dashed circles. (D) A table showing amino acids in library positions of the three orthogonal LL2 mutants, LL2.c17 (VFLV/IVVY), LL2.c7 (LVLF/FIVK) and LL2.c22 (IVFF/FILV), that were selected to compare differences in their affinity and structures. (E) Binding affinities of each combination of Z-A and Z-B mutants of the three mutants. (F) Significant structural difference at the interface of LL2.c17 and other two mutants. Superposition of overall structures (left). Close-up views of interface (right). LL2.c17 has Phe13A as the core of a central hydrophobic patch surrounded by multiple hydrogen bonds. LL2.c7 and c22 have a Phe9B-centered hydrophobic patch composed of clustered pi-pi interactions and cation-pi interactions (F31A, K35A, F9B, and W35B).
Figure 6.
Figure 6.. Sequence space expansion using protein language model
(A) A schematic representation of sequence space expansion through protein language model. (B) The fraction of LL1-type sequences (the Z-A and Z-B sequences can be encoded with LL1 degenerate codon sets) in LL2 sequencing data and vice versa. Fractions of each screening round (from naïve to R8) were represented in a Box plot with individual data points. A two-tailed Mann–Whitney test was used to analyze results. *** P < 0.001. (C) A schematic representation of our approach to predict dimer interactions with expanded set of amino acids using outer product-based convolutional neural network. (D) The classification efficiency of LL1-trained model on LL2 test set. (left) A violin plot representing predicted binding score of negative (n = 2,771) and positive (n = 2,794) data. Two-tailed Mann–Whitney test. **** P < 0.0001. (middle) A ROC plot and (right) a PR plot. Note that the sequences in test set were categorized into five groups based on the number of new amino acids compared to the LL1 sequence data, allowing an assessment of the impact of dissimilarity between the two libraries on predictions. The AUC (Area Under the ROC curve) and AP (Average Precision) values of total sequences and each subgroup are: all sequences (n = 5,565, AUC = 0.88, AP = 0.89), 0 AA (n = 508, AUC = 0.88, AP = 0.98), 1 AA (n = 1,332, AUC = 0.91, AP = 0.97), 2 AA (n = 1,509, AUC = 0.84, AP = 0.87), 3 AA (n = 1,091, AUC = 0.80, AP = 0.70), 4 and more AA (n = 1,125, AUC = 0.73, AP = 0.32). The diagonal dotted line in ROC plot represents AUC = 0.5. (E) The predicted binding scores of LL2 sequencing data of each screening round were represented in a violin plot. One-way ANOVA. ***P < 0.001, ****P < 0.0001. ns, not significant. (n = 28–10,000) (F) The correlation between predicted binding score of LL2 sequencing data and actual %HA-tag MFI after protease cleavage. Normalized %HA-tag MFI and predicted binding score of each round was compared by Spearman’s correlation test (r = 0.9643, P = 0.0028). Data are mean ± SD; n = 3 independent replicates for HA-tag MFI measurements. (G) The correlation between predicted binding score and relative affinity of the pairs from the mutational pathway in Fig. 6A. Normalized % of max HA-tag MFI from cleavage-capture assay and predicted binding score of each round was compared by Spearman’s correlation test (r = 0.5476, P = 0.0855). (H) Top 11 sequences by predicted binding score from LL2 NGS data. The binding of 6 out of the 11 sequences were verified by on-yeast cleavage-capture assay and their relative binding affinities were normalized to the high affinity LL2 pair, LL2.c3 (LVLF+FIIV). n.d. = not detectable affinity by the assay. (I) A cartoon representation depicting the expansion of sequence space from experimental LL1 data to the predicted LL2 sequence space using a protein language model and transfer learning.

References

    1. Darwin C , On the various contrivances by which British and foreign orchids are fertilised by insects, and on the good effects of intercrossing. (Murray J, London, 1862). - PMC - PubMed
    1. de Juan D, Pazos F, Valencia A, Nat. Rev. Genet 14, 249–261 (2013). - PubMed
    1. Lockless SW, Ranganathan R, Science. 286, 295–299 (1999). - PubMed
    1. Aakre CD et al. , Cell. 163, 594–606 (2015). - PMC - PubMed
    1. Zhang X, Perica T, Teichmann SA, Curr Opin Struct. Biol 6, 954–963 (2015). - PubMed

MeSH terms