Predicting protein model correctness in Coot using machine learning

Paul S Bond¹, Keith S Wilson¹, Kevin D Cowtan¹

Affiliations

PMID: 32744253
PMCID: PMC7397494
DOI: 10.1107/S2059798320009080

Predicting protein model correctness in Coot using machine learning

Paul S Bond et al. Acta Crystallogr D Struct Biol. 2020.

. 2020 Aug 1;76(Pt 8):713-723.

doi: 10.1107/S2059798320009080. Epub 2020 Jul 27.

Authors

Paul S Bond¹, Keith S Wilson¹, Kevin D Cowtan¹

Affiliation

¹ Department of Chemistry, University of York, York YO10 5DD, United Kingdom.

PMID: 32744253
PMCID: PMC7397494
DOI: 10.1107/S2059798320009080

Abstract

Manually identifying and correcting errors in protein models can be a slow process, but improvements in validation tools and automated model-building software can contribute to reducing this burden. This article presents a new correctness score that is produced by combining multiple sources of information using a neural network. The residues in 639 automatically built models were marked as correct or incorrect by comparing them with the coordinates deposited in the PDB. A number of features were also calculated for each residue using Coot, including map-to-model correlation, density values, B factors, clashes, Ramachandran scores, rotamer scores and resolution. Two neural networks were created using these features as inputs: one to predict the correctness of main-chain atoms and the other for side chains. The 639 structures were split into 511 that were used to train the neural networks and 128 that were used to test performance. The predicted correctness scores could correctly categorize 92.3% of the main-chain atoms and 87.6% of the side chains. A Coot ML Correctness script was written to display the scores in a graphical user interface as well as for the automatic pruning of chains, residues and side chains with low scores. The automatic pruning function was added to the CCP4i2 Buccaneer automated model-building pipeline, leading to significant improvements, especially for high-resolution structures.

Keywords: Coot; machine learning; model building; software; structure solution; validation.

open access.

PubMed Disclaimer

Figures

**Figure 1**
Diagram of the neural network. The input layer contains N scaled features (12 for the main-chain network and nine for the side-chain network), the hidden layer contains ten neurons and the output layer contains only one output with the correctness value. Each arrow has an associated coefficient and intercept that are modified during training.

**Figure 2**
Confusion matrices for (a) the main-chain and (b) the side-chain network. Values shown are percentages of residues in the test set.

**Figure 3**
A reversed amide bond where negative difference density at the next C^α suggests an error in the previous residue. The example is a peptide bond between asparagine and glycine in a 1.86 Å resolution structure built by *Buccaneer* that was not used in this study. The 2mF _o − DF _c map is shown in grey. The positive and negative contours of the mF _o − DF _c map are shown in green and red, respectively.

**Figure 4**
Resolution and mean main-chain target correctness for 639 structures in the training and test sets. The mean value for ten resolution bins is shown as a line.

**Figure 5**
Change in completeness, R _work and R _free between the released pipeline and the chain-pruning pipeline. The 867 structures were divided into ten resolution bins and the mean and standard error of the change for each bin is shown.

**Figure 6**
Change in completeness, R _work and R _free between the chain-pruning pipeline and the full pruning pipeline. The 867 structures were divided into ten resolution bins and the mean and standard error of the change for each bin is shown.

**Figure 7**
Completeness of the models from the released pipeline and the full pruning pipeline for the 867 structures tested.

**Figure 8**
Change in completeness, R _work and R _free between the released pipeline and the full pruning pipeline against the completeness of the model from the released pruning pipeline. The 867 structures were divided into ten completeness bins and the mean and standard error of the change for each bin is shown.

**Figure 9**
A section of PDB entry 4wn5 in (a) the model built by the released pipeline and (b) the model built by the full pruning pipeline. The 2mF _o − DF _c map is shown in blue. The positive and negative contours of the mF _o − DF _c map are shown in green and red, respectively. The yellow shaded area shows that the peptide bond is twisted, *i.e.* the ω angle is between 30° and 150°.

See this image and copyright information in PMC

References

1. Alharbi, E., Bond, P. S., Calinescu, R. & Cowtan, K. (2019). Acta Cryst. D75, 1119–1128. - PubMed
1. Bedem, H. van den, Wolf, G., Xu, Q. & Deacon, A. M. (2011). Acta Cryst. D67, 368–375. - PMC - PubMed
1. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. - PMC - PubMed
1. Burla, M. C., Carrozzini, B., Cascarano, G. L., Polidori, G. & Giacovazzo, C. (2018). Acta Cryst. D74, 1096–1104. - PubMed
1. Cowtan, K. (2006). Acta Cryst. D62, 1002–1011. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting protein model correctness in Coot using machine learning

Affiliation

Predicting protein model correctness in Coot using machine learning

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources