Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Jul 1;2(4):897-908.
doi: 10.1039/d3dd00044c. eCollection 2023 Aug 8.

Recent advances in the self-referencing embedded strings (SELFIES) library

Affiliations
Review

Recent advances in the self-referencing embedded strings (SELFIES) library

Alston Lo et al. Digit Discov. .

Abstract

String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencing embedded strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation called selfies. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints, and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of selfies, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of selfies (version 2.1.1) in this manuscript. Our library, selfies, is available at GitHub (https://github.com/aspuru-guzik-group/selfies).

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. For a fixed alphabet 1000 SELFIES strings were generated by uniformly sampling L symbols from an alphabet. Then, we plot the size distribution of the resulting molecules for varying symbol lengths L. (a) We take to be the 69 symbols returned by get_semantic_robust_alphabet( ) under the default semantic constraints. (b) We filter the alphabet in (a) to 19 symbols by removing all atom symbols [βα] where β ∈ {=, #} or ν(type(α)) = 1, and removing all branch and ring symbols except for [Branch1] and [Ring1]. This decreases the chance that the SELFIES derivation process is terminated early, causing the derived molecules to be larger. (c) The time taken to translate each batch of random SELFIES strings to SMILES using decoder( ), measured by averaging over 20 replicate trials.
Fig. 2
Fig. 2. The roundtrip translation time of 1000 randomly-sampled SMILES strings from the DTP open compound collection as a function of size, measured in number of atoms.

References

    1. Warr W. A. Representation of chemical structures. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2011;1:557–579.
    1. Wigh D. S. Goodman J. M. Lapkin A. A. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2022;12:e1603.
    1. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988;28:31–36. doi: 10.1021/ci00057a005. - DOI
    1. Gómez-Bombarelli R. Wei J. N. Duvenaud D. Hernández-Lobato J. M. Sánchez-Lengeling B. Sheberla D. Aguilera-Iparraguirre J. Hirzel T. D. Adams R. P. Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018;4:268–276. doi: 10.1021/acscentsci.7b00572. - DOI - PMC - PubMed
    1. Sanchez-Lengeling B. Aspuru-Guzik A. Inverse molecular design using machine learning: generative models for matter engineering. Science. 2018;361:360–365. doi: 10.1126/science.aat2663. - DOI - PubMed

LinkOut - more resources