Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 1;8(1):16189.
doi: 10.1038/s41598-018-34533-1.

Design of metalloproteins and novel protein folds using variational autoencoders

Affiliations

Design of metalloproteins and novel protein folds using variational autoencoders

Joe G Greener et al. Sci Rep. .

Abstract

The design of novel proteins has many applications but remains an attritional process with success in isolated cases. Meanwhile, deep learning technologies have exploded in popularity in recent years and are increasingly applicable to biology due to the rise in available data. We attempt to link protein design and deep learning by using variational autoencoders to generate protein sequences conditioned on desired properties. Potential copper and calcium binding sites are added to non-metal binding proteins without human intervention and compared to a hidden Markov model. In another use case, a grammar of protein structures is developed and used to produce sequences for a novel protein topology. One candidate structure is found to be stable by molecular dynamics simulation. The ability of our model to confine the vast search space of protein sequences and to scale easily has the potential to assist in a variety of protein design tasks.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
The Taylor grammar of protein structures shown for a reductase-related bacterial protein (PDB ID 2CU6). The orientation of the main secondary structural elements is examined in order to assign a topology string. See Taylor 2002 for a full description of the grammar.
Figure 2
Figure 2
The inference/generator (encoder/decoder) structure of the VAE. The data x and the conditioned attribute a are concatenated and passed through the inference model to produce the latent code z. The attribute is then concatenated to the sampled z (denoted by a dashed line) which going through the generator produces the reconstructed sequence xˆ.
Figure 3
Figure 3
Graphic showing how the latent space was explored through iterative sampling and analysis of protein sequences.
Figure 4
Figure 4
Example protein sequences generated by the model. (A) A sequence can be given as input, in which case similar sequences are returned by encoding and decoding the input sequence. (B) A topology string can be given as input, in which case sequences are generated that the VAE believes match the topology.
Figure 5
Figure 5
Automated addition of potential metal binding sites to two proteins. The sequence of the SWIB domain of human MDM2 (PDB ID 3LBL) and a sequence generated by the model that is predicted to have high copper-binding character are shown. Highlighted in purple are the residues that form the potential copper binding sites. The sites are shown on the structure of the generated sequence using 3LBL as a template, and compared to 3LBL. The same is shown for a potential site on the link module of human TSG-6 (PDB ID 1O7B) when calcium binding is requested.
Figure 6
Figure 6
Structures to explore a novel fold. The structure and sequence of 2CU6 are shown. By remodelling the loops at the locations of the blue circles using ModLoop a modified structure with a novel topology is generated. This is used as a backbone template to select structures from a pool of ab initio Rosetta structures generated from sequences output by the CVAE. The closest structure to the template is shown.
Figure 7
Figure 7
Analysis of MD runs of the generated structure shown in Fig. 6. Three runs of 200 ns are shown in blue, orange and green. (A) Backbone RMSD of trajectory structures to the energy-minimized starting structure. (B) Backbone radius of gyration (Rg) of trajectory structures.
Figure 8
Figure 8
Separation by structural properties in the latent space when 2 latent dimensions are used in the model. The axes are the 2 latent dimensions and each point is the encoded representation in the 2 dimensions of one input sequence. Clusters generally correspond to the homologues collected for each sequence. (A) Each sequence is coloured by CATH class according to the colours shown. (B) Sequences for one CATH architecture, ‘mainly beta single sheet’ (CATH ID 2.20), are highlighted in red.

References

    1. Huang P, Boyken SE, Baker D. The coming of age of de novo protein design. Nature. 2016;537:320–327. doi: 10.1038/nature19946. - DOI - PubMed
    1. Samish I, MacDermaid CM, Perez-Aguilar JM, Saven JG. Theoretical and computational protein design. Annu Rev Phys Chem. 2011;62:129–149. doi: 10.1146/annurev-physchem-032210-103509. - DOI - PubMed
    1. Yue K, Dill KA. Inverse protein folding problem: designing polymer sequences. Proc Natl Acad Sci USA. 1992;89:4163–4167. doi: 10.1073/pnas.89.9.4163. - DOI - PMC - PubMed
    1. Regan L. Protein design: novel metal-binding sites. Trends Biochem Sci. 1995;20:280–285. doi: 10.1016/S0968-0004(00)89044-1. - DOI - PubMed
    1. Kuhlman B, et al. Design of a novel globular protein fold with atomic-level accuracy. Science. 2003;302:1364–1368. doi: 10.1126/science.1089427. - DOI - PubMed

Publication types

Substances