. 2019 Aug 1;75(Pt 8):696-717.

doi: 10.1107/S2059798319008933. Epub 2019 Jul 30.

Prediction of models for ordered solvent in macromolecular structures by a classifier based upon resolution-independent projections of local feature data

Laurel Jones¹, Michael Tynes², Paul Smith¹

Affiliations

¹ Department of Chemistry, Fordham University, Bronx, NY 10458, USA.
² Department of Computer and Information Science, Fordham University, Bronx, NY 10458, USA.

PMID: 31373570
PMCID: PMC6677017
DOI: 10.1107/S2059798319008933

Prediction of models for ordered solvent in macromolecular structures by a classifier based upon resolution-independent projections of local feature data

Laurel Jones et al. Acta Crystallogr D Struct Biol. 2019.

. 2019 Aug 1;75(Pt 8):696-717.

doi: 10.1107/S2059798319008933. Epub 2019 Jul 30.

Authors

Laurel Jones¹, Michael Tynes², Paul Smith¹

Affiliations

¹ Department of Chemistry, Fordham University, Bronx, NY 10458, USA.
² Department of Computer and Information Science, Fordham University, Bronx, NY 10458, USA.

PMID: 31373570
PMCID: PMC6677017
DOI: 10.1107/S2059798319008933

Abstract

Current software tools for the automated building of models for macromolecular X-ray crystal structures are capable of assembling high-quality models for ordered macromolecule and small-molecule scattering components with minimal or no user supervision. Many of these tools also incorporate robust functionality for modelling the ordered water molecules that are found in nearly all macromolecular crystal structures. However, no current tools focus on differentiating these ubiquitous water molecules from other frequently occurring multi-atom solvent species, such as sulfate, or the automated building of models for such species. PeakProbe has been developed specifically to address the need for such a tool. PeakProbe predicts likely solvent models for a given point (termed a `peak') in a structure based on analysis (`probing') of its local electron density and chemical environment. PeakProbe maps a total of 19 resolution-dependent features associated with electron density and two associated with the local chemical environment to a two-dimensional score space that is independent of resolution. Peaks are classified based on the relative frequencies with which four different classes of solvent (including water) are observed within a given region of this score space as determined by large-scale sampling of solvent models in the Protein Data Bank. Designed to classify peaks generated from difference density maxima, PeakProbe also incorporates functionality for identifying peaks associated with model errors or clusters of peaks likely to correspond to multi-atom solvent, and for the validation of existing solvent models using solvent-omit electron-density maps. When tasked with classifying peaks into one of four distinct solvent classes, PeakProbe achieves greater than 99% accuracy for both peaks derived directly from the atomic coordinates of existing solvent models and those based on difference density maxima. While the program is still under development, a fully functional version is publicly available. PeakProbe makes extensive use of cctbx libraries, and requires a PHENIX licence and an up-to-date phenix.python environment for execution.

Keywords: PeakProbe; data mining; decorrelation; electron-density analysis; resolution; solvent modelling; supervised learning.

open access.

PubMed Disclaimer

Figures

**Figure 1**
An overview of the *PeakProbe* structure and workflow.

**Figure 2**
Solvent-class histograms over the score space spanned by ED and CC scores coloured according to which class of solvent is most likely given the corresponding CC and ED scores. Water, sulfate, heterogen and metal classes are shown in blue, red, green and yellow, respectively. The dark blue and red regions correspond to contours containing 50% of observed water and sulfate in training data. Grey regions correspond to regions that are highly unlikely to correspond to a true solvent model.

**Figure 3**
Data processing and feature scoring. (a) Procedures for mapping features to scores. For each group of features, the number of resolution-dependent parameters (red) or resolution-independent parameters (blue) required is shown. (b) Histograms of training data for a single feature in a single bin of resolution along with fitted probability density functions (PDFs) for water and sulfate data. Distributions like those shown are used to calculate ED and CC scores using the equations and definitions shown. (c) Example RPMS. Bin values for the observed mean (black dots) and standard deviation (cyan dots) for feature CC1 are plotted versus resolution. The spline fit to each series is shown as a solid line coloured similarly.

**Figure 4**
Results of data-processing methods on training data. In (a), (b) and (c), the values refer to Pearson’s product–moment correlation coefficient and r.m.s. refers to root-mean-square. (a) Inter-feature correlation for three example feature pairs. (b) Distributions of all 210 inter-feature correlation coefficients versus resolution. Coefficients are converted to r.m.s. values, plotted boxes correspond to values within the interquartile range, the median is shown as a horizontal bar and values outside this range are shown as dots. (c) Summary of inter-feature and versus-resolution correlation for ED and CC features at each stage of data processing. Inter-feature (inter-feat.) values refer to correlations between grouped features and versus-resolution (vs reso.) values to correlations between features and crystallographic resolution. Values are given for each stage in the data-processing workflow described in Fig. 2 ▸(a). (d) ED scores binned by resolution. Blue and red boxes represent the range of training-data ED scores for water and sulfate, respectively, that fall within one standard deviation of the mean (horizontal bar) of each resolution bin.

**Figure 5**
Peak-classification examples from six different structures. Peaks are shown as black spheres, macromolecular components are coloured by element (carbon in brown, oxygen in red, nitrogen in blue, hydrogen omitted) and posited models are shown in green. Electron density is shown as a mesh (F _o − F _c in red and 2F _o − F _c in grey contoured at 3.0σ and 1.0σ, respectively, unless noted otherwise). Top row: peaks from training data with nominally incorrect *PeakProbe* classifier predictions likely to be mislabelled in the PDB. Bottom row: peaks not associated with any existing solvent models but strongly predicted to belong to the class indicated by *PeakProbe*. Details are as follows. (a) PDB entry 4aqp (2.45 Å); the peak is water A2001 predicted to be a sulfate (modelled in green). Crystals were grown in the presence of the sulfate pseudo-analog 2-(N-morpholino)ethanesulfonic acid (MES). (b) PDB entry 2xrz (2.20 Å); the central peak is water B2012. Of the six peaks shown, four were strongly predicted to be heterogen. Crystals were grown in the presence of polyethylene glycol, a two-conformer model for which is shown as a point of reference in green. (c) PDB entry 2p3i (1.75 Å); the peak is the central S atom of sulfate A3000 (shown in cyan) and was strongly predicted to be water. (d) PDB entry 1mh3 (2.10 Å); the peak is adjacent to the terminal N atom of lysine A500 and was strongly predicted to be a sulfate (modelled in green). (e) PDB entry 2wjj (2.41 Å); four peaks are shown bracketed between the side chains of glutamate A95 and lysine A132, all strongly predicted to be heterogen. Crystallization conditions give no indications of likely models. (f) PDB entry 3zm4 (2.37 Å), F _o − F _c density contoured at 5.0σ, 2F _o − F _c density at 1.8σ; the peak is at a special position adjacent to aspartate A65 and is strongly predicted to be a metal. Crystals were grown in the presence of 0.2 M Ca²⁺ and the crystal lattice appears to be held together by electrostatic attraction between the acidic side chains shown and an unmodelled cation.

See this image and copyright information in PMC

References

1. Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. - PMC - PubMed
1. Akker, F. van den & Hol, W. G. J. (1999). Acta Cryst. D55, 206–218. - PubMed
1. Amadasi, A., Surface, J. A., Spyrakis, F., Cozzini, P., Mozzarelli, A. & Kellogg, G. E. (2008). J. Med. Chem. 51, 1063–1067. - PubMed
1. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. - PMC - PubMed
1. Biedermannová, L. & Schneider, B. (2015). Acta Cryst. D71, 2192–2202. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction of models for ordered solvent in macromolecular structures by a classifier based upon resolution-independent projections of local feature data

Affiliations

Prediction of models for ordered solvent in macromolecular structures by a classifier based upon resolution-independent projections of local feature data

Authors

Affiliations

Abstract

Figures

Similar articles

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources