Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 1;4(5):468-481.
doi: 10.1002/wcms.1183.

Machine learning methods in chemoinformatics

Affiliations
Free PMC article

Machine learning methods in chemoinformatics

John B O Mitchell. Wiley Interdiscip Rev Comput Mol Sci. .
Free PMC article

Abstract

Machine learning algorithms are generally developed in computer science or adjacent disciplines and find their way into chemical modeling by a process of diffusion. Though particular machine learning methods are popular in chemoinformatics and quantitative structure-activity relationships (QSAR), many others exist in the technical literature. This discussion is methods-based and focused on some algorithms that chemoinformatics researchers frequently use. It makes no claim to be exhaustive. We concentrate on methods for supervised learning, predicting the unknown property values of a test set of instances, usually molecules, based on the known values for a training set. Particularly relevant approaches include Artificial Neural Networks, Random Forest, Support Vector Machine, k-Nearest Neighbors and naïve Bayes classifiers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
We can conceive of chemoinformatics as a two-part problem: encoding chemical structure as features, and mapping the features to the output property. The second of these is most often the province of machine learning.
Figure 2
Figure 2
Five illustrative decision trees forming a (very small) Random Forest for classification. The terminal leaf nodes are shown as squares and colored red or green according to class. The path taken through each tree by a query instance is shown in orange. Trees A, B, C, and E predict that the instance belongs to the red class, tree D dissenting, so that the Random Forest will assign it to the red class by a 4–1 majority vote.
Figure 3
Figure 3
Illustration of a kNN classification model. For k = 1, the model will classify the blue query instance as a member of the red class; for k = 3, it will again be assigned to the red class, this time by a 2–1 vote; however, since the fourth and fifth nearest neighbors are both green, a k = 5 model would classify it as part of the green class by a 3–2 majority.
Figure 4
Figure 4
Design of a cross-validation exercise, here shown for eight-fold cross-validation. The identities of the six training, one test, and one internal validation folds are cyclically permuted.

References

    1. Hammett LP. Reaction rates and indicator acidities. Chem Rev. 1935;16:67–79.
    1. Hansch C, Fujita T. p-σ-π Analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc. 1964;86:1616–1626.
    1. Borman S. New QSAR techniques eyed for environmental assessments. Chem Eng News. 1990;19:20–23.
    1. Kowalski BR. Pattern recognition in chemical research. In: Klopfenstein CE, Wilkins CL, editors. Computers in Chemical and Biochemical Research. Vol. 2. Academic Press: New York; 1974. pp. 1–76.
    1. Lusci A, Pollastri G, Baldi P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J Chem Inf Model. 2013;53:1563–1575. - PMC - PubMed