Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 1;39(3):btad122.
doi: 10.1093/bioinformatics/btad122.

Accurate and efficient protein sequence design through learning concise local environment of residues

Affiliations

Accurate and efficient protein sequence design through learning concise local environment of residues

Bin Huang et al. Bioinformatics. .

Abstract

Motivation: Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.

Results: Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein.

Availability and implementation: The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(color online)Overview of protein sequence design process using ProDESIGN-LE. (A) To design a sequence for a desired target backbone structure, ProDESIGN-LE starts from a random sequence and then iteratively assigns an appropriate residue type for a randomly selected position according to its local environment. For example, the local environment of position 114 is fed into the local environment encoder and residue type distribution calculator, yielding a distribution over 20 residue types. ProDEISGN-LE assigns position 114 with the most likely amino acid TYR and thus mutates the design sequence accordingly. ProDESIGN-LE repeats these steps until all residues fit well with their local environments, eventually acquiring a designed sequence. (B) An example of the full-atom local environment around a target residue (in red, or center of the circle), which contains all atoms within a sphere with a predefined radius centered at the residue. (C) The concise but informative representation of the local environment used by ProDESIGN-LE considers the relative positions of neighboring residues. For the residue TYR114, its three neighbors SER103, CYS110, and ALA117 are shown here. For each residue, we construct a local frame with x and y being the result of applying the Gram–Schmidt process to {CαC,CαN}, and z being x×y. We then calculate a 3 × 3 transform matrix R and a 3D translation vector t for each neighbor of TYR114. (D) Energy versus RMSD plot of the predicted structure for the intermediate sequences during redesigning protein CAT III. Here, RMSD measures the proximity of the predicted structure to the target structure. The structures with smaller RMSD usually have lower energy, especially for the native-like proteins with RMSD <5 Å (blue/dark dots). (E) The design process for protein CAT III. The initial random sequence has its predicted structure deviating greatly from the target structure (TM-score: 0.16). After 200 rounds of iteration, ProDESIGN-LE acquires a design with the associated structure perfectly matching the target structure (TM-score: 0.64). (F) The superimposition of the predicted structure for the designed sequence (pink/light) with the target structure (blue/dark)
Figure 2
Figure 2
Assessing in silico the designed sequences for 68 naturally occurring (A–C) and 129 hallucinated proteins (D–F). We designed sequences for these proteins using FixBB, ProteinSolver, 3D-CNN, and ProDESIGN-LE and assessed the designed sequences using three metrics: (1) the sequence identity between the designed sequence and native sequence of the target structure (C, F); (2) the structure similarity (measured using TM-score) between the target structure and the predicted structure of the designed sequence (A, D); (3) we further built a threading structure through complementing the target backbone structure with the sidechains determined by designed sequences. The energy of the resultant threading structure is used as a measurement of the fitness between the designed sequence and the target backbone structure (B, E). We used AlphaFold2 (AF2) to predict structures for the naturally occurring proteins and ProFOLD-Single (PF) to predict structures for the hallucinated proteins
Figure 3
Figure 3
Accuracy of prediction of a residue’s residue type according to its local environment. (A) ProDESIGN-LE predicts a distribution over 20 residue types for a target residue and assigns each residue type with a confidence score. We calculate the top-K (K = 1, 2, 3, 4, and 5) accuracy of the predicted residue types exceeding a confidence score cut-off (x axis). (B) The relationship between the prediction accuracy and the ground-truth residue type extracted from the native sequence of the protein
Figure 4
Figure 4
Experimental characterization of the designed protein and natural CAT III protein. (Column A) Thermostable analysis of the two proteins by nanoDSF measurement. The designed protein CAT-h2: the onset denaturation temperature of protein is 64.6°C and the folding Tm value is 72.5°C; The natural protein CAT III: the onset denaturation temperature of protein is 53.7°C and the folding Tm value is 74.8°C. Ratio: 350 nm/330 nm fluorescence intensity. (Column B) Circular dichroism spectra of the proteins from 185 to 260 nm at 25°C. The designed protein CAT-h2 (red) exhibited circular dichroism spectra consistent with the natural protein CAT III (blue)

References

    1. Alford RF, Leaver-Fay A, Jeliazkov JR. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput 2017;13:3031–48. - PMC - PubMed
    1. Anand N, Eguchi R, Mathews II. et al. Protein sequence design with a learned potential. Nat Commun 2022;13:1–11. - PMC - PubMed
    1. Anishchenko I, Pellock SJ, Chidyausiku TM. et al. De novo protein design by deep network hallucination. Nature 2021;600:547–52. - PMC - PubMed
    1. Bepler T, Berger B. Learning protein sequence embeddings using information from structure. arXiv, arXiv:1902.08661, 2019. https://arxiv.org/pdf/1902.08661.pdf.
    1. Berman H, Henrick K, Nakamura H. et al. Announcing the worldwide protein data bank. Nat Struct Biol 2003;10:980. - PubMed

Publication types