Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 26;59(8):3370-3388.
doi: 10.1021/acs.jcim.9b00237. Epub 2019 Aug 13.

Analyzing Learned Molecular Representations for Property Prediction

Affiliations

Analyzing Learned Molecular Representations for Property Prediction

Kevin Yang et al. J Chem Inf Model. .

Erratum in

  • Correction to Analyzing Learned Molecular Representations for Property Prediction.
    Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R. Yang K, et al. J Chem Inf Model. 2019 Dec 23;59(12):5304-5305. doi: 10.1021/acs.jcim.9b01076. Epub 2019 Dec 9. J Chem Inf Model. 2019. PMID: 31814400 Free PMC article. No abstract available.

Abstract

Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Illustration of bond-level message passing in our proposed D-MPNN. (a) Messages from the orange directed bonds are used to inform the update to the hidden state of the red directed bond. By contrast, in a traditional MPNN, messages are passed from atoms to atoms (for example, atoms 1, 3, and 4 to atom 2) rather than from bonds to bonds. (b) Similarly, a message from the green bond informs the update to the hidden state of the purple directed bond. (c) Illustration of the update function to the hidden representation of the red directed bond from diagram (a).
Figure 2
Figure 2
Four example distributions fit to a random sample of 100,000 compounds used for biological screening in Novartis. Note that some distributions for discrete calculations, such as fr_pyridine, are not fit especially well. This is an active area for improvement.
Figure 3
Figure 3
Comparison of our D-MPNN with features to the best models from Wu et al.
Figure 4
Figure 4
Comparison of our best single model (i.e., optimized hyperparameters and RDKit features) to the model from Mayr et al.
Figure 5
Figure 5
Comparison of our unoptimized D-MPNN against several baseline models. We omitted the random forest baseline on PCBA, MUV, ToxCast, and ChEMBL due to large computational cost. Random forest is omitted on ClinTox due to numerical instability. The D-MPNN significantly outperforms each baseline on at least 8 data sets.
Figure 6
Figure 6
Comparison of our D-MPNN against baseline models on Amgen internal data sets on a chronological data split. D-MPNN outperforms all of the baselines. Note that the ensembles were ensembles of 3 models rather than 5 for the Amgen data sets only. Also note that RF on Morgan and Mayr et al. FFN were only run once on RLM.
Figure 7
Figure 7
Comparison of our D-MPNN against baseline models on BASF internal regression data sets on a scaffold data split (higher = better). Our D-MPNN outperforms all baselines.
Figure 8
Figure 8
Comparison of our D-MPNN against baseline models on the Novartis internal regression data set on a chronological data split (lower = better). Our D-MPNN outperforms all baseline models.
Figure 9
Figure 9
Comparison of Amgen’s internal model and our D-MPNN (evaluated using a single run on a chronological split) to experimental error (higher = better). Note that the experimental error is not evaluated on the exact same time split as the two models since it can only be measured on molecules which were tested more than once, but even so the difference in performance is striking.
Figure 10
Figure 10
Overlap of molecular scaffolds between the train and test sets for a random or chronological split of four Amgen regression data sets. Overlap is defined as the percent of molecules in the test set which share a scaffold with a molecule in the train set.
Figure 11
Figure 11
Performance of D-MPNN on four Amgen regression data sets according to three methods of splitting the data (lower = better). The chronological split is significantly harder than both random and scaffold on Sol and hPXR, while the scaffold split is significantly harder than the random split on Sol only.
Figure 12
Figure 12
Performance of D-MPNN on the Novartis regression data set according to three methods of splitting the data (lower = better). The chronological split is significantly harder than the random split while the scaffold split is not.
Figure 13
Figure 13
Performance of D-MPNN on the full (F), core (C), and refined (R) subsets of the PDBbind data set according to three methods of splitting the data (lower = better). The chronological and scaffold splits are significantly harder than the random split in all cases except for the PDBbind-C scaffold split.
Figure 14
Figure 14
Performance of D-MPNN on random and scaffold splits for several public data sets. Only the results on PDBbind-C, HIV, ClinTox, and ChEMBL are not statistically significant.
Figure 15
Figure 15
Comparison of performance of different message passing paradigms.
Figure 16
Figure 16
Effect of adding molecule-level features generated with RDKit to our model.
Figure 17
Figure 17
Effect of performing Bayesian hyperparameter optimization on the depth, hidden size, number of fully connected layers, and dropout of the D-MPNN.
Figure 18
Figure 18
An illustration of ensembling models. On the left is a single model, which takes input and makes a prediction. On the right is an ensemble of 3 models. Each model takes the same input and makes a prediction independently, and then the predictions are averaged to generate the ensemble’s prediction.
Figure 19
Figure 19
Effect of using an ensemble of five models instead of a single model.
Figure 20
Figure 20
Effect of data size on the performance of the model from Mayr et al. and of our D-MPNN model (higher = better). All comparisons besides the first are statistically significant.

References

    1. Duvenaud D. K.; Maclaurin D.; Iparraguirre J.; Bombarell R.; Hirzel T.; Aspuru-Guzik A.; Adams R. P. Convolutional Networks on Graphs for Learning Molecular Fingerprints. Advances in Neural Information Processing Systems 2015, 2224–2232.
    1. Wu Z.; Ramsundar B.; Feinberg E.; Gomes J.; Geniesse C.; Pappu A. S.; Leswing K.; Pande V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci. 2018, 9, 513–530. 10.1039/C7SC02664A. - DOI - PMC - PubMed
    1. Kearnes S.; McCloskey K.; Berndl M.; Pande V.; Riley P. Molecular Graph Convolutions: Moving Beyond Fingerprints. J. Comput.-Aided Mol. Des. 2016, 30, 595–608. 10.1007/s10822-016-9938-8. - DOI - PMC - PubMed
    1. Gilmer J.; Schoenholz S. S.; Riley P. F.; Vinyals O.; Dahl G. E. Neural Message Passing for Quantum Chemistry. Proceedings of the 34th International Conference on Machine Learning 2017, 70, 1263–1272.
    1. Li Y.; Tarlow D.; Brockschmidt M.; Zemel R.. Gated Graph Sequence Neural Networks. 2015, arXiv preprint arXiv:1511.05493. https://arxiv.org/abs/1511.05493 (accessed Aug 6, 2019).

Publication types

MeSH terms