Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 14;10(1):134.
doi: 10.1038/s41597-023-02035-z.

SARS-CoV-2 receptor-binding domain deep mutational AlphaFold2 structures

Affiliations

SARS-CoV-2 receptor-binding domain deep mutational AlphaFold2 structures

Oz Kilim et al. Sci Data. .

Abstract

Leveraging recent advances in computational modeling of proteins with AlphaFold2 (AF2) we provide a complete curated data set of all single mutations from each of the 7 main SARS-CoV-2 lineages spike protein receptor binding domain (RBD) resulting in 3819X7 = 26733 PDB structures. We visualize the generated structures and show that AF2 pLDDT values are correlated with state-of-the-art disorder approximations, implying some internal protein dynamics are also captured by the model. Joint increasing mutational coverage of both structural and phenotype data coupled with advances in machine learning can be leveraged to accelerate virology research, specifically future variant prediction. We hope this data release can offer assistance into further understanding of the local and global mutational landscape of SARS-CoV-2 as well as provide insight into the biological understanding that 3D structure acts as a bridge between protein genotype and phenotype.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Sketch of protein representations and their projections. Sequence space 𝔽, structure space 𝕊, adjacency matrix space 𝔸, phenotype space ℙ are the spaces of all possible proteins for a given representation. 𝕊 contains 𝔽 and 𝔸, formally, 𝔽, 𝔸 ⊂ 𝕊. Each protein has a FASTA one-hot-encoded representation F ∈ 𝔽, a PDB file S ∈ 𝕊, an adjacency projection of the PDB file A ∈ 𝔸 and some measured phenotypic properties (function) P ∈ ℙ. We compare the projections 𝔽 and 𝔸 with respect to how a model f learns from these representations to make predictions about ℙ. (a) Predict structure with AlphaFold2.(b) Learning to predict protein-protein binding affinities from FASTA sequences. In the limit of huge amounts of genomic and phenotype data, this may even build such a rich internal representation of protein interaction dynamics that explicit structure modeling (the top path of the loop) is not required. (c) Creation of adjacency matrices from PDB structures. Representations in A carry no chemical information so can be used to analyze if the AF2 projection to S actually captured geometric signal that can be leveraged for phenotype prediction tasks, this representation has the added advantage of being rotation agnostic. (d) Learning to predict protein-protein binding affinities with adjacency matrices. (e) 𝕊 representations in PDB contain both chemical and geometrical information. An end goal could be to use this representation to build predictive models to predict ℙ in a similar fashion to previously proposed methods. However, this pathway is only worth using if we validate that (d) is possible to some extent.
Fig. 2
Fig. 2
(a) AF2 aligned Wuhan WT RBD superimposed onto the experimentally determined 6M0J, (RBD-ACE2 complex) clearly shows excellent agreement with respect to local and global structure. The RMSD value is 0.67 Å, which is due to the slight deviation between the structures in “loop” areas such as positions 371 and 478. (b) Variant defining mutations on SARS Cov-19 spike protein RBD. Wuhan RBD is in the cartoon illustration while the variant-defining mutations are illustrated with the licorice drawing method. The residue positions and the color codes are indicated.
Fig. 3
Fig. 3
(a) Visualization of the entire cluster of single mutants backbones from Wuhan WT. Variation is observable however global overlap is clear. (b) Visualization of the entire cluster of single mutants from Wuhan WT with side chains visible. Diversity in positions is more prominent than looking at the backbone variation.
Fig. 4
Fig. 4
(a) UMAP embedding of the one-hot encoding representation from FASTA files. Distinct clusters are seen for each variant with homogeneous spacing in 𝔽 (b) UMAP embedding for all adjacency matrices. We can see similar clusters are conserved in the 3D structural data, however, there is more overlap between clusters in 𝔸 (the space of adjacency matrices, see Fig. 1c), indicating structural similarity between some variants. These embeddings also offer insight into the higher dimensionality and complexity of the generated structural information. We observe small sub-clusters (in 𝔸) of some lineages where distortions have taken place at the core of the structure causing more structural distortion, this may also cause drastic phenotypic change.
Fig. 5
Fig. 5
(a) Amino acid-wise protein disorder analysis for the Wuhan single mutants. In the upper diagram, discrete “valleys” are observed that are common to all single mutants. This elucidated the physical consistency of AF2 predictions in a single mutational cluster. The thick pink line is observed due to the extensive overlap of many of the variants pLDDT and IU pred2 values. (b) Structural intuition into high IUPred2 values. More disordered positions are at loops shown in red and green. (c) Transformed axes with Delta cluster as an example. A strong correlation is observed between the AF uncertainty measurement pLDDT and the IU pred2 disorder prediction. See Table 1.

References

    1. Starr TN, et al. Deep mutational scanning of sars-cov-2 receptor binding domain reveals constraints on folding and ace2 binding. Cell. 2020;182:1295–1310.e20. doi: 10.1016/j.cell.2020.08.012. - DOI - PMC - PubMed
    1. Tzou PL, Tao K, Pond SLK, Shafer RW. Coronavirus resistance database (cov-rdb): Sars-cov-2 susceptibility to monoclonal antibodies, convalescent plasma, and plasma from vaccinated persons. Plos one. 2022;17:e0261045. doi: 10.1371/journal.pone.0261045. - DOI - PMC - PubMed
    1. Vangone, A. & Bonvin, A. M. Contacts-based prediction of binding affinity in protein–protein complexes. elife4 (2015). - PMC - PubMed
    1. Kastritis PL, Rodrigues JP, Folkers GE, Boelens R, Bonvin AM. Proteins feel more than they see: fine-tuning of binding affinity by properties of the non-interacting surface. Journal of molecular biology. 2014;426:2632–2652. doi: 10.1016/j.jmb.2014.04.017. - DOI - PubMed
    1. Greaney AJ, Starr TN, Bloom JD. An antibody-escape estimator for mutations to the sars-cov-2 receptor-binding domain. Virus evolution. 2022;8:veac021. doi: 10.1093/ve/veac021. - DOI - PMC - PubMed