Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul;88(7):819-829.
doi: 10.1002/prot.25868. Epub 2020 Jan 6.

ProDCoNN: Protein design using a convolutional neural network

Affiliations

ProDCoNN: Protein design using a convolutional neural network

Yuan Zhang et al. Proteins. 2020 Jul.

Abstract

Designing protein sequences that fold to a given three-dimensional (3D) structure has long been a challenging problem in computational structural biology with significant theoretical and practical implications. In this study, we first formulated this problem as predicting the residue type given the 3D structural environment around the C α atom of a residue, which is repeated for each residue of a protein. We designed a nine-layer 3D deep convolutional neural network (CNN) that takes as input a gridded box with the atomic coordinates and types around a residue. Several CNN layers were designed to capture structure information at different scales, such as bond lengths, bond angles, torsion angles, and secondary structures. Trained on a very large number of protein structures, the method, called ProDCoNN (protein design with CNN), achieved state-of-the-art performance when tested on large numbers of test proteins and benchmark datasets.

Keywords: ProDCoNN; convolutional neural network; inverse folding problem; protein design; protein engineering.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Architecture of neural network. A, Left: snapshot of the gridded box that captures the environment atoms to predict the amino acid type (red). Right: visualization of atoms (blue) captured by the box which used as input to CNN sequence prediction. BBO model uses the backbone atoms (C, Cα, N, O, and the extra oxygen atom OXT on the terminal carboxyl group) and pseudo-Cβ atom which added manually with the bond length of Cα–Cβ as 1.521 Å, bond angle N–C–Cα as 110.4° and dihedral angle N–C–Cα–Cβ as 122.55°. BBS model uses the same atom set but label Cβ (green) on nontarget residues differently based on the residue types in the protein sequence. B, The architecture of the designed deep Neural Network. The input consists of N (Nchannel) set of 18× 18× 18 gridded boxes, and each channel represents the occupation of one type of atom
FIGURE 2
FIGURE 2
Count of each amino acids in ID90TR (top); Precision, recall, and F1 score of different amino acids of the networks BBO (blue) and BBS (orange) trained on ID90TR
FIGURE 3
FIGURE 3
Top-K recovery rate of different models. Left: BBO_ID30 (black), BBO_ID90 (red), BBS_ID30 (green), BBS_ID90 (blue) models tested on ID30TS or ID90TS. Right: same models tested on TS50
FIGURE 4
FIGURE 4
Sequence prediction by BBO_ID30 (BBO30), BBO_ID90 (BBO90), BBS_ID30 (BBS30), and BBS_ID90 (BBS90) model on human hemoglobin protein 1a3nA compared with true wild-type sequence (red). The amino acids shown below the true label row are alternative amino acids at the corresponding position based on the sequences which are similar (P-value = 0, twist = 0) with 1a3nA calculated by FATCAT. An orange background indicates exactly correct prediction, and a blue background indicates that the prediction matches one of the alternative amino acids
FIGURE 5
FIGURE 5
Confusion Matrix of BBO_ID90. The y-axis represents true labels and x-axis indicates predicted labels. The number in each entry indicates the number of times each amino acid was predicted as one of the 20 amino acids and the color shows the corresponding probability
FIGURE 6
FIGURE 6
Modified confusion matrix for model BBO_ID90 (right) compared with BLOSUM62 (left). The two matrices are very similar (P-value = 0 from permutation test)
FIGURE 7
FIGURE 7
Count of each amino acids in ID30TR and ID90TR (top); Precision, recall, and F1 score of different amino acids of the networks BBO_ID30 (blue) and BBS_ID90 (orange)

References

    1. Samish I. Achievements and challenges in computational protein design. Methods Mol Biol. 1529, 21–94, 2017. 10.1007/978-1-4939-6637-0_2 - DOI - PubMed
    1. Korkegian A, Black ME, Baker D, Stoddard BL. Computational thermostabilization of an enzyme. Science. 308 (5723), 857–860, 2005. 10.1126/science.1107387. - DOI - PMC - PubMed
    1. Jiang L, Althoff EA, Clemente FR, et al. De novo computational design of retro-aldol enzymes. Science. 319 (5868), 1387–1391, 2008. 10.1126/science.1152692. - DOI - PMC - PubMed
    1. Röthlisberger D, Khersonsky O, Wollacott AM, et al. Kemp elimination catalysts by computational enzyme design. Nature. 2008;453: 190–195. 10.1038/nature06879. - DOI - PubMed
    1. Murphy PM, Bolduc JM, Gallaher JL, Stoddard BL, Baker D. Alteration of enzyme specificity by computational loop remodeling and design. Proc Natl Acad Sci. 2009;106:9215–9220. 10.1073/pnas.0811070106. - DOI - PMC - PubMed

Publication types

LinkOut - more resources