Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 7;378(6615):49-56.
doi: 10.1126/science.add2187. Epub 2022 Sep 15.

Robust deep learning-based protein sequence design using ProteinMPNN

Affiliations

Robust deep learning-based protein sequence design using ProteinMPNN

J Dauparas et al. Science. .

Abstract

Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning-based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo-electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Authors declare that they have no competing interests.

Figures

Fig. 1.
Fig. 1.. ProteinMPNN architecture.
(A) Distances between N, Ca, C, O, and virtual Cb are encoded and processed using a message passing neural network (Encoder) to obtain graph node and edge features. The encoded features together with a partial sequence are used to generate amino acids iteratively in a random decoding order. (B) A fixed left to right decoding cannot use sequence context (green) for preceding positions (yellow) whereas a model trained with random decoding orders can be used with arbitrary decoding order during the inference. The decoding order can be chosen such that the fixed context is decoded first. (C) Residue positions within and between chains can be tied together, enabling symmetric, repeat protein, and multistate design. In this example, a homo-trimer is designed with coupling of positions in different chains. Predicted logits for tied positions are averaged to get a single probability distribution from which amino acids are sampled.
Fig. 2.
Fig. 2.. In silico evaluation of ProteinMPNN.
(A) ProteinMPNN has higher native sequence recovery than Rosetta. The average Cb distance of the 8 closest neighbors (x axis) reports on burial, with most buried positions on the left and more exposed on the right; ProteinMPNN outperforms Rosetta at all levels of burial. Average sequence recovery for ProteinMPNN was 52.4%, compared to 32.9% for Rosetta. (B) ProteinMPNN has similarly high sequence recovery for monomers, homo-oligomers, and hetero-oligomers; violin plots are for 690 monomers, 732 homomers, 98 heteromers. (C) Sequence recovery (black) and relative AlphaFold success rates (blue) as a function of training noise level. For higher accuracy predictions (circles) smaller amounts of noise are optimal (1.0 corresponds to 1.8% success rate), while to maximize prediction success at a lower accuracy cutoff (squares), models trained with more noise are better (1.0 corresponds to 6.7% success rate). (D) Sequence recovery and diversity as a function of sampling temperature. Redesign of native protein backbones with ProteinMPNN considerably increases AphaFold prediction accuracy compared to the original native sequence using no multiple sequence information. Single sequences (designed or native) were input in both cases. (F) ProteinMPNN redesign of previous Rosetta designed NTF2 fold proteins (3,000 backbones in total) results in considerably improved AlphaFold single sequence prediction accuracy.
Fig. 3.
Fig. 3.. Structural characterization of ProteinMPNN designs.
(A) Comparison of soluble protein expression over a set of AlphaFold hallucinated monomers and homo-oligomers (blue) and the same set of backbones with sequences designed using ProteinMPNN (orange), N=129. The total soluble protein yield following expression in E. coli, obtained from the integrated area unders size exclusion traces of nickel-NTA purified proteins, increases considerably from the barely soluble protein of the original sequences following ProteinMPNN rescue (median yields for 1 L of culture equivalent: 9 and 247 mg respectively). (B), (C), (D) In depth characterization of a monomer hallucination and corresponding ProteinMPNN rescue from the set in A. Like almost all of the designs in A, the sequence and structural similarity to the PDB of the design model are very low (E-value=2.8 against UniRef100 using HHblits, TM-score=0.56 against PDB). (B) The ProteinMPNN rescued design has high thermostability, with a virtually unchanged circular dichroism profile at 95 °C compared to 25 °C (C) Size exclusion (SEC) profile of failed original design overlaid with the ProteinMPNN sequence design, which has a clear monodisperse peak at the expected retention volume. (D) Crystal structure of the ProteinMPNN (8CYK) design is nearly identical to the design model (2.35 RMSD over 130 residues), see Figure S5 for additional information. Right panel shows model sidechains in the electron density, in green crystal side chains, in blue AlphaFold side chains. (E), (F) ProteinMPNN rescue of Rosetta design made from a perfectly repeating structural and sequence unit (DHR82). Residues at corresponding positions in the repeat unit were tied during ProteinMPNN sequence inference. (E) Backbone design model and MPNN redesigned sequence AlphaFold model with tied residues indicated by lines (~1.2Å error over 232 residues). (F) SEC profile of IMAC purified original Rosetta design and two ProteinMPNN redesigns. (G), (H) Tying residues during ProteinMPNN sequence inference both within and between chains to enforce both repeat protein and cyclic symmetries. (G) Side view of design model. A set of tied residues are shown in red. (H) Top-down view of design model. (I) Negative stain electron micrograph of purified design. (J) Class average of images from I closely match top down view in H. (K) Rescue of the failed two-component Rosetta tetrahedral nanoparticle design T33–27 (13) by ProteinMPNN interface design. Following ProteinMPNN rescue, the nanoparticle assembled readily with high yield, and the crystal structure (grey) is very nearly identical to the design model (green/purple) (backbone RMSD of 1.2 Å over two complete asymmetric units forming the ProteinMPNN rescued interface).
Fig. 4.
Fig. 4.. Design of protein function with ProteinMPNN.
(A) Design scheme. First panel; structure (PDB 2W0Z) of the peptide APPPRPPKP bound to the human Grb2 C-term SH3 domain (peptide is in green, target in surface and colored blue). Second panel: helical bundle scaffolds were docked to the exposed face of the peptide using RIFDOCK (19), and Rosetta remodel was used to build loops connecting the peptide to the scaffolds. Rosetta sequence design with layer design task operations was used to optimize the sequence of the fusion (Cyan) for stability, rigidity of the peptide-helical bundle interface, and binding affinity for the Grb2 SH3 domain. Third panel; ProteinMPNN redesign (orange) of the designed binder sequence; hydrogen bonds involving asparagine sidechains between the peptide and base scaffold are shown in green and in the inset. Fourth panel; Mutation of the two asparagines to aspartates to disrupt the scaffolding of the target peptide. (B) Experimental characterization of binding using biolayer interferometry. Biotinylated C-term SH3 domain from human Grb2 was loaded onto Streptavidin (SA) Biosensors, which were then immersed in solutions containing varying concentrations of the target peptide (left) of the designs (right panels), and then transferred to buffer lacking added protein for dissociation measurements. The MPNN design (3rd panel from the left) has much greater binding signal than the original Rosetta design (2nd panel from the left); this is greatly reduced by the asparagine to aspartate mutations (last panel).

References

    1. Ingraham J, Garg V, Barzilay R, & Jaakkola T (2019). Generative models for graph-based protein design. Advances in Neural Information Processing Systems, 32.
    1. Zhang Y, Chen Y, Wang C, Lo CC, Liu X, Wu W, & Zhang J (2020). ProDCoNN: Protein design using a convolutional neural network. Proteins: Structure, Function, and Bioinformatics, 88(7), 819–829. - PMC - PubMed
    1. Qi Y, & Zhang JZ (2020). DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with DenseNet. Journal of Chemical Information and Modeling, 60(3), 1245–1252. - PubMed
    1. Jing B, Eismann S, Suriana P, Townshend RJL, & Dror R (2020, September). Learning from Protein Structure with Geometric Vector Perceptrons. In International Conference on Learning Representations.
    1. Strokach A, Becerra D, Corbi-Verge C, Perez-Riba A, & Kim PM (2020). Fast and flexible protein design using deep graph neural networks. Cell systems, 11(4), 402–411. - PubMed