Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul;35(7):819-830.
doi: 10.1007/s10822-021-00400-x. Epub 2021 Jun 28.

Predicting partition coefficients for the SAMPL7 physical property challenge using the ClassicalGSG method

Affiliations

Predicting partition coefficients for the SAMPL7 physical property challenge using the ClassicalGSG method

Nazanin Donyapour et al. J Comput Aided Mol Des. 2021 Jul.

Abstract

The prediction of [Formula: see text] values is one part of the statistical assessment of the modeling of proteins and ligands (SAMPL) blind challenges. Here, we use a molecular graph representation method called Geometric Scattering for Graphs (GSG) to transform atomic attributes to molecular features. The atomic attributes used here are parameters from classical molecular force fields including partial charges and Lennard-Jones interaction parameters. The molecular features from GSG are used as inputs to neural networks that are trained using a "master" dataset comprised of over 41,000 unique [Formula: see text] values. The specific molecular targets in the SAMPL7 [Formula: see text] prediction challenge were unique in that they all contained a sulfonyl moeity. This motivated a set of ClassicalGSG submissions where predictors were trained on different subsets of the master dataset that are filtered according to chemical types and/or the presence of the sulfonyl moeity. We find that our ranked prediction obtained 5th place with an RMSE of 0.77 [Formula: see text] units and an MAE of 0.62, while one of our non-ranked predictions achieved first place among all submissions with an RMSE of 0.55 and an MAE of 0.44. After the conclusion of the challenge we also examined the performance of open-source force field parameters that allow for an end-to-end [Formula: see text] predictor model: General AMBER Force Field (GAFF), Universal Force Field (UFF), Merck Molecular Force Field 94 (MMFF94) and Ghemical. We find that ClassicalGSG models trained with atomic attributes from MMFF94 can yield more accurate predictions compared to those trained with CGenFF atomic attributes.

Keywords: Chemical features; Geometric scattering for graphs; Log P; Machine learning; Molecular representations; Neural networks; Partition coefficient; SAMPL7 challenge.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The SAMPL7 log P challenge molecules. The SAMPL7 target molecules are shown in their 2D structures in their neutral microstate (micro000). The 2D structures are generated and drawn from SMILES by RDkit [64].
Fig. 2
Fig. 2
Cumulative distribution of MAE of molecules in the S7_TEST set. The solid blue line shows the cumulative distributions for each set of predictions. The dashed red line represents the probability of 90%. Panels A through D show MAEs using models trained on DB1 through DB4, respectively.
Fig. 3
Fig. 3
Prediction intervals of log P predictions for the SAMPL7 molecules. The experimental log P values are shown in red circles as a scatter plot. The predictions are shown in a red line, and the orange wide range shows the prediction intervals (PIs). Panels A through D show predictions from models trained on DB1 through DB4, respectively. In all cases, data is sorted according to the predicted log P values.
Fig. 4
Fig. 4
The best fit lines for prediction sets. The experimental versus prediction values are shown in red circles as a scatter plot. The actual fit line is shown in orange. The dashed blue curve shows the best fit line. A) predictions using DB1, B) predictions using DB2, C) predictions using DB3, and D) predictions using the DB4 training set.
Fig. 5
Fig. 5
The log P predictions from our submissions to the SAMPL7 challenge. The orange line shows the experimental values. The ClassicalGSG predictions are shown as circles (DB1: blue, DB2: orange, DB3: green, DB4: red). The thick orange area shows the MAE interval of 0.44, which is the lowest MAE of our submitted predictions (ClassicalGSG-DB2). Molecules are labeled with their molecule ID from SAMPL7 [27].
Fig. 6
Fig. 6
Results of ClassicalGSG models trained using open-source force field parameters. Error bars are computed over five independently-trained models. These models are trained using the 2D structure information and using all the scattering moments with the maximum wavelet scale (J) of 4. For each set of ClassicalGSG models trained using these force field parameters we show A) the average RMSE, and B) the average r2.

References

    1. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ, Advanced drug delivery reviews 23(1–3), 3 (1997) - PubMed
    1. Noble A, Journal of Chromatography A 642(1–2), 3 (1993)
    1. Paschke A, Neitzel PL, Walther W, Schüürmann G, Journal of Chemical & Engineering Data 49(6), 1639 (2004)
    1. Sicbaldi F, Del Re AA, in Reviews of environmental contamination and toxicology (Springer, 1993), pp. 59–93
    1. Kajiya K, Ichiba M, Kuwabara M, Kumazawa S, NAKAYAMA T, Bioscience, biotechnology, and biochemistry 65(5), 1227 (2001) - PubMed

LinkOut - more resources