Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec;16(12):1315-1322.
doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.

Unified rational protein engineering with sequence-based deep representation learning

Affiliations

Unified rational protein engineering with sequence-based deep representation learning

Ethan C Alley et al. Nat Methods. 2019 Dec.

Abstract

Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.

PubMed Disclaimer

Figures

Fig. 1 |
Fig. 1 |. Workflow to learn and apply deep protein representations.
a, UniRep model was trained on 24 million UniRef50 primary amino-acid sequences. The model was trained to perform next amino-acid prediction (minimizing cross-entropy loss) and, in so doing, was forced to learn how to internally represent proteins. b, During application, the trained model is used to generate a single fixed-length vector representation of the input sequence by globally averaging intermediate mLSTM numerical summaries (the hidden states). A top model (for example, a sparse linear regression or random forest) trained on top of the representation, which acts as a featurization of the input sequence, enables supervised learning on diverse protein informatics tasks.
Fig. 2 |
Fig. 2 |. UniRep encodes amino-acid physicochemistry, organism level information, secondary structure, evolutionary and functional information, and higher-order structural features.
a, PCA of amino-acid embeddings learned by UniRep (n = 20 amino acids). b, t-SNE of the proteome-average UniRep vector of model organisms (n = 53 organism proteomes, Supplementary Table 1). c, Low dimensional t-SNE visualization of UniRep represented sequences from SCOP colored by ground-truth structural classes, which were assigned after crystallization (n = 28,025 SCOP proteins). a/b, alphas and betas. d, Agglomerative distance-based clustering of UniRep, a Doc2Vec representation method from Yang et al., a deep structural method (RGN) from AlQuraishi, Levenshtein (global sequence alignment) distance and the best of a suite of machine learning baselines (Methods). Scores show how well each approach reconstitutes expert-labeled family groupings from OXBench and HOMSTRAD. All metrics vary between 0 and 1, with 0 being a random assignment and 1 being a perfect clustering (Methods). e, Activation pattern of the helix-sheet secondary structure neuron colored on the structure of the Lac repressor LacI (PDB 2PE5, right). f, Average helix-sheet neuron (as visualized in e) activation as a function of relative position along a secondary structure unit (Methods).
Fig. 3 |
Fig. 3 |. UniRep predicts structural and functional properties of proteins.
a, Spearman correlation with true measured stability rankings of de novo designed mini proteins UniRep Fusion-based model predictions and two alternative approaches: negative Rosetta total energy and buried nonpolar surface area (NPSA; Methods). UniRep Fusion outperforms both alternatives (P < 0.001, Welch’s two-tailed t-test on n = 30 bootstrap replicates). b, UniRep performance compared to a suite of baselines across 17 proteins in the DMS stability prediction task (Pearson’s r). UniRep Fusion achieved significantly higher Pearson’s r on all subsets (P < 0.006, Welch’s two-tailed t-test on n = 30 bootstrap replicates). c, Average magnitude of top-model regression coefficients for de novo designed and natural protein stability prediction show significant co-activation (P<0.01, permutation test). d, Activations of the helix-sheet neuron colored onto the de novo designed protein HHH_0142 (PDB 5UOI). e, UniRep Fusion achieves statistically lower mean squared error than a suite of baselines across a set of eight proteins with nine diverse functions in the DMS function prediction task (P < 0.009 on Pab1 and P < 0.0002 on all other tasks, Welch’s two-tailed t-test on n = 30 bootstrap replicates).
Fig. 4 |
Fig. 4 |. UniRep, fine-tuned to a local evolutionary context, facilitates protein engineering by enabling generalization to distant peaks in the sequence landscape.
a, UniRep trades-off nuance for universality in a theoretical protein engineering task. By unsupervised training on a subspace of sequences related to the engineering target—’evotuning’—UniRep representations are honed to the task at hand. b, Predicted brightness of 27 homologs and engineered variants of avGFP under various representations + sparse linear regression models trained only on local avGFP data. Box and whisker plots indicate predicted distribution of dark negative controls (n = 32,400; center line, median; box limits, upper and lower quartiles; whiskers, full data range). Green region above dotted line is predicted bright, below is predicted dark. On the left in gray is the training distribution from local mutants of avGFP. c, Predicted brightness versus mutation curves for each of the 27 avGFP homologs and engineered variants (the generalization set). Each gray line depicts the average predicted brightness of one of the 27 generalization set members as an increasing number of random mutations is introduced. Red line shows the average empirical brightness versus mutation curve for avGFP. Average Pearson correlation between predicted and empirical brightness curves also shown. d, Recall versus sequence testing budget curves for each representation + sparse linear regression top model (bottom). Efficiency gain over random sampling (top) depicted as the ratio of a method’s recall divided by the recall of the null model as a function of testing budget. e, Maximum brightness observed versus sequence testing budget curves for each representation + sparse linear regression top model (bottom). Efficiency gain over random sampling analogously defined for recall but instead with normalized maximum brightness (top). Error bands depict ±1 standard deviation calculated over n = 100 bootstrap replicates.

References

    1. Packer MS & Liu DR Methods for the directed evolution of proteins. Nat. Rev. Genet 16, 379–394 (2015). - PubMed
    1. Romero PA & Arnold FH Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol 10, 866–876 (2009). - PMC - PubMed
    1. Biswas S et al. Toward machine-guided design of proteins. Preprint at bioRxiv 10.1101/337154 (2018). - DOI
    1. Bedbrook CN, Yang KK, Rice AJ, Gradinaru V & Arnold FH Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS Comput. Biol 13, e1005786 (2017). - PMC - PubMed
    1. Rocklin GJ et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017). - PMC - PubMed

Publication types