Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 13;185(21):4008-4022.e14.
doi: 10.1016/j.cell.2022.08.024. Epub 2022 Aug 31.

Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV-2 receptor-binding domain

Affiliations

Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV-2 receptor-binding domain

Joseph M Taft et al. Cell. .

Abstract

The continual evolution of SARS-CoV-2 and the emergence of variants that show resistance to vaccines and neutralizing antibodies threaten to prolong the COVID-19 pandemic. Selection and emergence of SARS-CoV-2 variants are driven in part by mutations within the viral spike protein and in particular the ACE2 receptor-binding domain (RBD), a primary target site for neutralizing antibodies. Here, we develop deep mutational learning (DML), a machine-learning-guided protein engineering technology, which is used to investigate a massive sequence space of combinatorial mutations, representing billions of RBD variants, by accurately predicting their impact on ACE2 binding and antibody escape. A highly diverse landscape of possible SARS-CoV-2 variants is identified that could emerge from a multitude of evolutionary trajectories. DML may be used for predictive profiling on current and prospective variants, including highly mutated variants such as Omicron, thus guiding the development of therapeutic antibody treatments and vaccines for COVID-19.

Keywords: artificial intelligence; deep learning; deep sequencing; directed evolution; machine learning; protein engineering; viral escape; yeast display.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests ETH Zurich has filed for patent protection on the technology described herein, and J.M.T., C.R.W., B.G., R.A.E., and S.T.R. are named as co-inventors. C.R.W. is an employee of Alloy Therapeutics (Switzerland) AG. C.R.W. and S.T.R. may hold shares of Alloy Therapeutics. S.T.R. is on the scientific advisory board of Alloy Therapeutics.

Figures

None
Graphical abstract
Figure 1
Figure 1
Overview of deep mutational learning of the RBD for prediction of ACE2 binding and antibody escape The RBD or the SARS-CoV-2 spike protein is expressed on the surface of yeast, and mutagenesis libraries are designed on the RBM of the RBD (RBM-3, RBM-1, and RBM-2), which are the sites of interaction with ACE2 and neutralizing antibodies (e.g., therapeutic antibody drugs). RBD libraries are screened by FACS for binding to ACE2 and neutralizing antibodies, both binding and non-binding (escape) populations are isolated and subjected to deep sequencing. Machine learning models are trained to predict binding status to ACE2 or antibodies based on RBD sequence. Machine learning models are then used to predict ACE2 binding and antibody escape on current and prospective variants and lineages.
Figure 2
Figure 2
Design of RBD mutagenesis libraries and screening by yeast surface display and deep sequencing (A) Shown is the amino acid usage in the combinatorial libraries (libraries 3C, 1C, and 2C). Degenerate codons are derived from DMS data for ACE2 binding (Starr et al., 2020). (B) Representative examples of degenerate codons tiled across RBM-2, which are pooled together to comprise library 2T. (C) Flow cytometry dot plots depict yeast display screening of combinatorial (1C, 2C, 2CE, and 3C) and tiling (1T, 2T, and 3T) RBD libraries and control RBD (Wu-Hu-1); gating schemes correspond to selection of ACE2-binding and non-binding variants. (D) Amino acid logo plots of the RBD are based on deep sequencing data from ACE2-binding and non-binding selections. (E) Flow cytometry dot plots depict yeast display screening of pooled RBD libraries (2C and 2CE) after selection for ACE2 binding; gating schemes correspond to selection of variants for binding and escape (non-binding) to monoclonal antibodies (mAbs). See also Figures S1 and S2 and Tables S1–S3.
Figure S1
Figure S1
Design and screening of RBD libraries, related to Figure 2 and Table S1 (A) Amino acid distribution of combinatorial libraries RBM-1 and RBM-3. (B) Yeast display of RBD libraries pre-selected for ACE2 binding were sorted by flow cytometry for binding and escape to four therapeutic monoclonal antibodies (mAbs): LY-CoV16, LY-CoV555, REGN10933, and REGN10987. (C) A further nine monoclonal antibodies were screened for binding and escape. Approximately 107 yeast cells were screened for each antibody.
Figure S2
Figure S2
Combinatorial sequence space of RBD libraries following selection, related to Figure 2 and Table S2 Sequence logo plots of sorted populations for ACE2 binding and antibody escape. For each population, up to the 10,000 most abundant unique amino acid sequences after read count thresholding are shown.
Figure 3
Figure 3
Training and testing of machine and deep learning models for prediction of ACE2 binding and antibody escape based on RBD sequence (A) Deep sequencing data from ACE2 and monoclonal antibody (mAb) selections is encoded by one-hot encoding and used to train supervised machine learning (e.g., random forest [RF]) and deep learning models (e.g., recurrent neural network [RNN]). Models perform classification by predicting a probability (P) of ACE2 binding or non-binding and mAb binding or escape (non-binding) based on the RBD sequence. (B and C) Performance of RF and RNN models trained on 2T, 2C, or Full ACE2 or LY-CoV16 binding data shown by accuracy, F1, and receiver operating characteristic (ROC) curves. Models are evaluated by rounds of external cross-validation (n = 5), with mean performance displayed and standard deviation indicated by error bars. Low and high distance sequences are defined as those ≤ED5 and ≥ED6 from Wu-Hu-1 RBD, respectively. (D and E) Accuracy, F1, and AUC of all 13 mAb models trained on RBM-2 and RBM-1 data, evaluated on both low and high distance test sequences. See also Figures S3 and S4 and Table S4. Detailed sequences used as the training data for individual models, Table S6. Machine and deep learning model predictions compared to susceptibility data from the Stanford Database.
Figure S3
Figure S3
Performance metrics machine learning models, related to Figure 3 and Table S4. Detailed sequences used as the training data for individual models, Table S6. Machine and deep learning model predictions compared to susceptibility data from the Stanford Database (A) K-nearest neighbors (KNNs), logistic regression (Log Reg), naive Bayes (NB), random forest (RF), long-short term memory recurrent neural network (RNN), support vector machine with linear kernel (SVM Linear), and support vector machine with radial basis function kernel (SVM RBF) models were trained on the ACE2 deep sequencing data without hyperparameter optimization. Models were then challenged to perform classification by predicting a probability (P) of ACE2 binding on test data. Performance of models was evaluated by accuracy, F1, precision, and recall. All models except RNN were trained using Sci-kit Learn, and the RNN was trained using Keras. (B) K-nearest neighbors (KNNs), logistic regression (Log Reg), naive Bayes (NB), random forest (RF), long-short term memory recurrent neural network (RNN), support vector machine with linear kernel (SVM Linear), and support vector machine with radial basis function kernel (SVM RBF) models were trained on the ACE2 deep sequencing data without hyperparameter optimization. Models were then challenged to perform classification by predicting a probability (P) of ACE2 binding on test data. Performance of models was evaluated by accuracy, F1, precision, and recall. All models except RNN were trained using Sci-kit Learn, and the RNN was trained using Keras. (C and D) DMS trained random forest (RF) and long-short term memory recurrent neural network (RNN) models were evaluated on the larger combinatorial ACE2 binding test data shown by accuracy, F1 graphs, and ROC curves.
Figure S4
Figure S4
Distribution of binding and non-binding across RBM regions, related to Figure 3 Count distributions of unique binding/non-binding sequences from the ACE2 and antibody selection library datasets after pre-processing. (A) RBM-1, (B) RBM-2, and (C) RBM-3.
Figure S5
Figure S5
Experimental evaluation of selected RBD variants for antibody escape, related to Figure 4 (A) The 46 selected synthetic variants were individually cloned and expressed for yeast display and ACE2 binding by flow cytometry. 43 variants showed ACE2 binding or non-binding that matched machine learning predictions. The ACE2-binding status for two variants (38 and 42) could not be conclusively determined, while one variant (41) showed was incorrectly predicted by machine learning for ACE2 binding. (B) RBD sequences at chosen EDs (ED0, ED3, ED5, and ED7) from the Wu-Hu-1 RBD were predicted for ACE2 binding and escape from four therapeutic monoclonal antibodies (mAbs). Accuracy for antibody escape predictions are the following: LY-CoV16 = 31/33 (93.94%), LY-CoV555 = 30/33 (90.91%), REGN10933 = 31/33 (93.94%), and REGN10987 = 32/33 (96.97%). (C and D) Two double mutants, and their constituent mutations, which were predicted to display epistasis were assayed individually by yeast surface display. (E and F) Three synthetic RBD variants of ED3 from Wu-Hu-1 RBD that were predicted to escape all four therapeutic antibodies by the consensus machine learning model were expressed as individual clones in yeast and evaluated by flow cytometry for binding to antibody or ACE2.
Figure 4
Figure 4
Prediction and experimental validation of synthetic lineages of RBD variants (A) Workflow to select and test synthetic variants at chosen edit distances (ED3, ED5, and ED7) from Wu-Hu-1 RBD. (B) Lineage plot of synthetic variants depicts machine learning predictions and experimental validation (Figure S5) for ACE2 binding and non-binding. (C) Dot plots of synthetic variants correspond to machine learning model (RF and RNN) predictions and experimental validation for antibody binding or escape. (D) Structural modeling by AlphaFold2 shows predicted structures of RBD variants that are ACE2 binding (green boxes) or non-binding (red boxes); control is Wu-Hu-1 RBD (black box). See also Figure S5.
Figure 5
Figure 5
Predictive profiling of selected RBD variants for antibody escape across low mutational distances (A, D, and G) Heatmap depicts monoclonal antibody (mAb) binding as assessed by RF and RNN models of ED1 and ED2 variants of Alpha, Beta, and Kappa. (B, E, and H) The number of sequences escaping a combination of n (number) mAbs for ED1 and ED2 (agreement between models, threshold >0.5). (C, F, and I) Deep escape networks display possible evolutionary paths between variants and their escape from mAbs. See also Figure S6.
Figure S6
Figure S6
Predictive profiling of additional selected RBD variants for antibody escape across low mutational distances, related to Figure 5 (A, D, and G) Heatmap depicts monoclonal antibody (mAb) binding as assessed by RF and RNN models of ED1 and ED2 variants of Wu-Hu-1, Gamma, and B.1.523. (B, E, and H) The number of sequences escaping a combination of n (number) mAbs for ED1 and ED2 (agreement between models, threshold > 0.5). (C, F, and I) Deep escape networks display possible evolutionary paths between variants and their escape from mAbs.
Figure 6
Figure 6
Determining antibody robustness to synthetic RBD variants and mutational lineages (A) Omicron (BA.1) mutations covered by combinatorial library RBM-2. (B) Binding prediction for single and combinatorial mutations observed in Omicron. (C) Dynamic escape profile along Omicron lineage with percentage escape sequences across all mutations at distance 1–4 from Wu-Hu-1. (D) Antibody prediction of ACE2 binding RBDs for each antibody at edit distance 6–10 from Wu-Hu-1 (10,000 sequences simulated in triplicate, only confident predictions shown (i.e., P(ACE2 binding) > 0.5 and either P(antibody binding) > 0.75 or P(antibody escape) < 0.25 for both RNN and RF). (E) Total count of confident predictions across all distances (mean across triplicates).

References

    1. Akbar R., Robert P.A., Weber C.R., Widrich M., Frank R., Pavlović M., Scheffer L., Chernigovskaya M., Snapkov I., Slabodkin A., et al. In silico proof of principle of machine learning-based antibody design at unconstrained scale. bioRxiv. 2021 doi: 10.1101/2021.07.08.451480. Preprint at. - DOI - PMC - PubMed
    1. Antia R., Halloran M.E. Transition to endemicity: understanding COVID-19. Immunity. 2021;54:2172–2176. doi: 10.1016/j.immuni.2021.09.019. - DOI - PMC - PubMed
    1. Barnes C.O., West A.P., Huey-Tubman K.E., Hoffmann M.A.G., Sharaf N.G., Hoffman P.R., Koranda N., Gristick H.B., Gaebler C., Muecksch F., et al. Structures of human antibodies bound to SARS-CoV-2 spike reveal common epitopes and recurrent features of antibodies. Cell. 2020;182:828–842.e16. doi: 10.1016/j.cell.2020.06.025. - DOI - PMC - PubMed
    1. Baum A., Fulton B.O., Wloga E., Copin R., Pascal K.E., Russo V., Giordano S., Lanza K., Negron N., Ni M., et al. Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies. Science. 2020;369:1014–1018. doi: 10.1126/science.abd0831. - DOI - PMC - PubMed
    1. Boder E.T., Wittrup K.D. Yeast surface display for screening combinatorial polypeptide libraries. Nat. Biotechnol. 1997;15:553–557. doi: 10.1038/nbt0697-553. - DOI - PubMed

Publication types

Supplementary concepts