. 2021 Feb 26;17(2):e1008736.

doi: 10.1371/journal.pcbi.1008736. eCollection 2021 Feb.

Generating functional protein variants with variational autoencoders

Alex Hawkins-Hooker¹, Florence Depardieu¹, Sebastien Baur¹, Guillaume Couairon¹, Arthur Chen¹, David Bikard¹

Affiliations

PMID: 33635868
PMCID: PMC7946179
DOI: 10.1371/journal.pcbi.1008736

Generating functional protein variants with variational autoencoders

Alex Hawkins-Hooker et al. PLoS Comput Biol. 2021.

. 2021 Feb 26;17(2):e1008736.

doi: 10.1371/journal.pcbi.1008736. eCollection 2021 Feb.

Authors

Alex Hawkins-Hooker¹, Florence Depardieu¹, Sebastien Baur¹, Guillaume Couairon¹, Arthur Chen¹, David Bikard¹

Affiliation

¹ Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France.

PMID: 33635868
PMCID: PMC7946179
DOI: 10.1371/journal.pcbi.1008736

Abstract

The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Schematic representation of the input representation and VAE models used in the study.**
Models take as input either raw or aligned sequences. In the latter case, the inputs correspond to the rows of an MSA of the luciferase family. Only columns of the MSA corresponding to positions (highlighted in red) present in the target protein (marked with *) are retained. In both cases, the sequences are one-hot encoded before being fed into the model. Different architectures were used depending on the type of sequence input. The model developed to work with aligned sequences (MSA VAE) used fully-connected feed-forward networks in both the encoder and the decoder. The model developed to work with raw sequences (AR-VAE) comprised a CNN encoder and a decoder which combined upsampling with autoregression. The decoder sequentially outputs predictions for the identity of the amino acid at each point in the sequence, conditioned on the upsampled latent representation together with the previous amino acids in either the input sequence (during training, blue arrow) or the generated sequence (when being used generatively, red arrow).

**Fig 2. Amino acid representations learnt by VAE models capture biochemical properties.**
Left: pairwise cosine similarities between amino-acid output embeddings from an AR-VAE model trained on unaligned sequences correlate with amino acid substitution scores in the BLOSUM 62 substitution matrix (Spearman ρ = 0.423, n = 190); right: projection of AR-VAE output embedding weights onto first two principal components groups embeddings corresponding to biochemically related amino acids.

**Fig 3. Organization of latent space reflects functional groupings.**
Visualisation of the latent representation of validation set sequences for MSA VAE (left) and AR-VAE (right), projected onto first two prinicipal components and coloured by sub-family annotation derived from InterPro. Only sequences belonging to one of the 9 largest sub-families are shown.

**Fig 4. Statistics computed from alignments of generated sequences to natural sequences from the training set.**
Similarity of statistics between generated and natural sequences reflect the ability of models to capture important types of sequence variation. Single-site amino acid frequencies (left) capture patterns of residue conservation at each position in the alignment, while co-occurrence frequencies (centre) and covariances (right) between amino acid identities at different pairs of positions reflect patterns of evolutionary covariation which may indicate structural or functional constraints. Sequences were generated by sampling from the prior of the VAE models. For MSA VAE the resulting sequences were already aligned; for the raw sequences generated by AR-VAE, a new MSA was first constructed by running Clustal Omega on the set of sequences sampled from the model together with the natural sequences in the training set, using the bacterial luciferase family PFAM profile HMM as an External Profile Alignment, following which statistics for generated and natural sequences were computed from the corresponding subsets of the alignment. As a baseline we also report results for statistics generated by the profile HMM from PFAM. In this case the training set statistics were computed from the alignment of the training sequences to the profile HMM.

**Fig 5. Comparison of inter-residue couplings inferred from generated sequences to contacts in the 3D structure of a *luxA* protein.**
Left: contact map of *luxA*, showing 1000 closest contacts separated by at least 4 sequence positions. Centre and right: top 1000 couplings inferred from sequences generated by MSA VAE and AR-VAE respectively, coloured by distance between residues in *luxA* 3D structure. Couplings were predicted using CCMPred on samples of 3000 sequences, and only couplings between residues separated by at least 4 sequence positions were shown. The patterns of inferred couplings reflect the dependencies captured by the models: while MSA VAE captures realistic dependencies between positions at a range of distances, the sequences generated by AR-VAE exhibit a bias towards local dependencies.

**Fig 6. Luminescence measurements for synthesised protein sequences generated from latent vectors sampled from the neighbourhood of the encoding of the *P. luminescens* *luxA* sequence.**
Left: luminescence of sequences generated by VAE models trained on raw (AR-VAE) or aligned (MSA VAE) sequences from the family of luciferase-like proteins (mean across fifteen replicates, error bars represent standard deviation). Wild-type sequence luminescence is displayed as a dashed green line. The dashed grey line represents the detection threshold, conservatively set to twice the mean untransformed luminescence of a strain lacking *luxA*. Distance is computed as number of substitutions and indels relative to wild type. The MSA VAE model was able to generate functional sequences with large numbers of differences to wild type, whereas the AR-VAE model seemed to introduce deleterious mutations more rapidly. Center and right: measurements of both solubility and luminescence for sequences generated by VAE models conditioned on predicted solubility level show that conditional models can be used to engineer increased-solubility variants of a *luxA* sequence while preserving function. Solubility is reported as the ratio of the amount of protein present in the supernatant to the total amount in both supernatant and pellet of lysed *E. coli* cells over-expressing the protein, as measured by a dot blot assay (mean of four technical replicates, error bars represent standard deviation).

**Fig 7. Computational analysis of variants generated by conditional VAE models conditioned on predicted solubility level.**
Left: distribution of predicted solubilities of sequences generated when conditioning on each of three solubility levels (median and upper and lower quartiles indicated with horizontal lines); centre: difference in amino acid composition percentages between generated variants at highest solubility level and original P19839 *luxA* sequence, including values for combined amino acid features used in protein-sol prediction algorithm; right: distribution of charge of *luxA* variants generated by conditioning on high (top) and medium (bottom) solubility levels. For comparison, the charge of the original P19839 *luxA* sequence is shown as a dashed line, the average charge for high solubility sequences in the training set is shown as a solid green line, and the average charge for medium solubility sequences in the training set is shown as a solid red line).

See this image and copyright information in PMC

Cited by

Funneling modulatory peptide design with generative models: Discovery and characterization of disruptors of calcineurin protein-protein interactions.
Tubiana J, Adriana-Lifshits L, Nissan M, Gabay M, Sher I, Sova M, Wolfson HJ, Gal M. Tubiana J, et al. PLoS Comput Biol. 2023 Feb 2;19(2):e1010874. doi: 10.1371/journal.pcbi.1010874. eCollection 2023 Feb. PLoS Comput Biol. 2023. PMID: 36730443 Free PMC article.
Therapeutic enzyme engineering using a generative neural network.
Giessel A, Dousis A, Ravichandran K, Smith K, Sur S, McFadyen I, Zheng W, Licht S. Giessel A, et al. Sci Rep. 2022 Jan 27;12(1):1536. doi: 10.1038/s41598-022-05195-x. Sci Rep. 2022. PMID: 35087131 Free PMC article.
Engineering Dehalogenase Enzymes Using Variational Autoencoder-Generated Latent Spaces and Microfluidics.
Kohout P, Vasina M, Majerova M, Novakova V, Damborsky J, Bednar D, Marek M, Prokop Z, Mazurenko S. Kohout P, et al. JACS Au. 2025 Feb 13;5(2):838-850. doi: 10.1021/jacsau.4c01101. eCollection 2025 Feb 24. JACS Au. 2025. PMID: 40017771 Free PMC article.
Bayesian estimation of muscle mechanisms and therapeutic targets using variational autoencoders.
Tune T, Kooiker KB, Davis J, Daniel T, Moussavi-Harami F. Tune T, et al. Biophys J. 2025 Jan 7;124(1):179-191. doi: 10.1016/j.bpj.2024.11.3310. Epub 2024 Nov 26. Biophys J. 2025. PMID: 39604261 Free PMC article.
AMPGAN v2: Machine Learning-Guided Design of Antimicrobial Peptides.
Van Oort CM, Ferrell JB, Remington JM, Wshah S, Li J. Van Oort CM, et al. J Chem Inf Model. 2021 May 24;61(5):2198-2207. doi: 10.1021/acs.jcim.0c01441. Epub 2021 Mar 31. J Chem Inf Model. 2021. PMID: 33787250 Free PMC article.

See all "Cited by" articles

References

1. Packer MS, Liu DR. Methods for the directed evolution of proteins. Nature Reviews Genetics. 2015;16(7):379–394. 10.1038/nrg3927 - DOI - PubMed
1. Arnold FH. Directed Evolution: Bringing New Chemistry to Life. Angewandte Chemie International Edition. 2018;57(16):4143–4148. 10.1002/anie.201802332 - DOI - PMC - PubMed
1. Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, et al.. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science. 2017;357(6347):168. 10.1126/science.aan0693 - DOI - PMC - PubMed
1. Dahiyat BI, Mayo SL. De Novo Protein Design: Fully Automated Sequence Selection. Science. 1997;278(5335):82–87. 10.1126/science.278.5335.82 - DOI - PubMed
1. Kraemer-Pecore CM, Lecomte JTJ, Desjarlais JR. A de novo redesign of the WW domain. Protein Science. 2003;12(10):2194–2205. 10.1110/ps.03190903 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generating functional protein variants with variational autoencoders

Affiliation

Generating functional protein variants with variational autoencoders

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials