Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 29;12(42):14174-14181.
doi: 10.1039/d1sc01839f. eCollection 2021 Nov 3.

Img2Mol - accurate SMILES recognition from molecular graphical depictions

Affiliations

Img2Mol - accurate SMILES recognition from molecular graphical depictions

Djork-Arné Clevert et al. Chem Sci. .

Abstract

The automatic recognition of the molecular content of a molecule's graphical depiction is an extremely challenging problem that remains largely unsolved despite decades of research. Recent advances in neural machine translation enable the auto-encoding of molecular structures in a continuous vector space of fixed size (latent representation) with low reconstruction errors. In this paper, we present a fast and accurate model combining deep convolutional neural network learning from molecule depictions and a pre-trained decoder that translates the latent representation into the SMILES representation of the molecules. This combination allows us to precisely infer a molecular structure from an image. Our rigorous evaluation shows that Img2Mol is able to correctly translate up to 88% of the molecular depictions into their SMILES representation. A pretrained version of Img2Mol is made publicly available on GitHub for non-commercial users.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. Example showing molecular depictions of structural isomerism – the configuration of a molecule: bonds that (i) are in the image plane are shown as regular lines, (ii) angles down beneath the plane are dashed and (iii) angles up the image plane are drawn as a solid wedge.
Fig. 2
Fig. 2. Conceptual view of the CDDD autoencoder. An input SMILES string representing a molecule is encoded into a 512-dimensional feature vector (the CDDD embedding). The decoder was trained to produce canonical SMILES from the CDDD embedding.
Fig. 3
Fig. 3. Overview of the Img2Mol workflow for molecular optical recognition. The left part is what is newly trained in this work, while the pretrained CDDD decoder is used to obtain canonical SMILES.
Fig. 4
Fig. 4. Example showing the molecular depictions of the same structure, which are randomly generated from the SMILES CS(O) (O)c1cccc(c2ccc(C3C(CCC(O)c4ccc(F)cc4)C(O)N3c3ccc(F)cc3)cc2)c1. All molecular depictions are generated from the same SMILES; such variations are randomly generated during training.
Fig. 5
Fig. 5. Results for the molecular optical resolution task for varying input resolutions. The left and right panels show the accuracy and the Tanimoto similarity as a function of the image resolution, respectively. Note that Img2Mol(no aug.) was trained without augmenting the image resolution and therefore the performance decreases with increasing resolution.
Fig. 6
Fig. 6. Panels (A and C) show the accuracy and the Tanimoto similarity for 512 px resolution images as a function of the number of atoms in the molecule, respectively. Panel (B): the expected upper bound for the reconstruction accuracy of the Img2Mol network is plotted as a function of molecular size. Panel (D): computational wall-clock time in [s] for processing 5000 images as a function of image resolution and is 255, 274, 450 and 1179[s].
Fig. 7
Fig. 7. Images (a and b), (c and d) and (e and f) were taken from ref. , self-drawn and adapted from ChemPix, respectively. Img2Mol is in principle able to recognise simple hand-drawn molecules (d–f) without errors, but introduces errors for more complex, larger molecules (a–c). The dashed red line indicates the incorrectly predicted region of the molecule.

References

    1. Gaulton A. Bellis L. J. Bento A. P. Chambers J. Davies M. Hersey A. Light Y. McGlinchey S. Michalovich D. Al-Lazikani B. Overington J. P. Nucleic Acids Res. 2011;40:D1100–D1107. doi: 10.1093/nar/gkr777. - DOI - PMC - PubMed
    1. Stewart D., Stewart A. and Wheatley-Price P., et al., 16th World Conference on Lung Cancer, 2015
    1. Vinyals O., Toshev A., Bengio S. and Erhan D., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
    1. McDaniel J. R. Balmuth J. R. J. Chem. Inf. Comput. Sci. 1992;32:373–378. doi: 10.1021/ci00008a018. - DOI
    1. Park J. Rosania G. R. Shedden K. A. Nguyen M. Lyu N. Saitou K. Chem. Cent. J. 2009;3:4. doi: 10.1186/1752-153X-3-4. - DOI - PMC - PubMed