Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Oct 14;3(10):100588.
doi: 10.1016/j.patter.2022.100588.

SELFIES and the future of molecular string representations

Affiliations
Review

SELFIES and the future of molecular string representations

Mario Krenn et al. Patterns (N Y). .

Abstract

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings-most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Special-purpose type writer for chemistry (A) Typical tape obtained with the Army Chemical Typewriter (ACT) built by members of the Walter Reed Army Institute of Research. (B) The ACT, a mechanical typewriter for the encoding of chemical structures. (C) Typed characters from the ACT. Image from Feldman et al.
Figure 2
Figure 2
Molecular string representations (A–C) Derivation of established string representations (A) Smiles, (B) DeepSmiles, and (C) InChI from molecular structures using 3,4-methylenedioxymethamphetamine (MDMA) as an example. Branches and ring closures are represented by specific syntax based on the main path (orange). (D) Derivation of a Selfies string from the molecular structure, building on the corresponding derivation rules.
Figure 3
Figure 3
Decoding points from the internal representation (latent space) of a variational autoencoder (VAE) Green stands for valid and blue for invalid molecules. The left image is trained using Smiles strings, most of its latent space representing invalid molecular strings. The right image shows the latent space of a VAE trained with Selfies. Every point stands for a physically meaningful molecule. Figure from Krenn et al.
Figure 4
Figure 4
Graphical representation of the mapping from strings to their corresponding structures Smiles maps to general structures that include molecules but also non-molecular graphs or invalid (non-graph) structures. InChI maps to the same space, although in a unique, bijective way. DeepSmiles maps strings to general graphs, not all of which stand for molecular graphs. Finally, Selfies is the only representation that maps in a surjective way only to molecular graphs.
Figure 5
Figure 5
Schematic of BigSmiles representations from Lin et al. Polymers are represented as monomers (repeating units) enclosed within curly brackets; the curly brackets indicate that the molecule is a stochastic object. The monomers are represented as Smiles strings, with additional information expressing the connectivity between monomeric units.
Figure 6
Figure 6
Nets for representing crystals (A) Crystal structure of graphene (2D honeycomb lattice). (B) 2D carbon structure of an orthorhombic lattice. The structures are two different faithful 2D embeddings of the same underlying net. This shows that a net, unlike its real space realization, does not bear spatial information (e.g., bond lengths, coordinates). Inspired by the success of Selfies in representing finite molecular graphs, in the section "Net and quotient graph," we discuss how Selfies can be extended to represent crystal nets.
Figure 7
Figure 7
Construction of the labeled quotient graph (LQG) for the underlying net of graphene (A) Embed the net corresponding to graphene. (B) Define a coordinate system with two basis vectors (solid arrows) and an origin in the (0, 0) cell encompassed by solid lines. Index cells by their positions relative to the cell containing the origin. (C) Group bonds into three bond classes (black, blue, green) by translational invariance. (D) The result is the LQG. The label of (0, 0) bonds is dropped by convention.
Figure 8
Figure 8
Examples of molecules with complicated bonds (A) Different structural representations for diborane (B2H6), where 1 properly accounts for the symmetrical B2H6 “diamond core” but gives an incorrect valence electron (VE) count; 2 uses zero-order bonds, indicated as dashed lines, to preserve the VE count but features a molecular symmetry that is too low; 3 attempts to capture the actual three-center two-electron (3c-2e) bonding by use of arced “banana bonds” but cannot be used in molecular graph approaches, which only allow for each edge to connect two nodes (atoms); and 4 shows the full delocalization of an electron pair over the B–H–B unit. (B) Lewis structures of ferrocene (C10H10Fe), where 5 is unfortunately used by PubChem but is wrong, as the compound is not ionic. 6 and 7 cannot account for the 1H and 13C NMR spectra, both of which feature only one singlet, indicative of ten chemically equivalent CH units. Only 8 is fully in line with crystallographic and spectroscopic data but at the expense of making electron counting impossible.
Figure 9
Figure 9
More examples of molecules with complicated bonds (A and B) Examples of (A) helical and (B) axial chirality in organic compounds (C) Diastereomeric coordination compounds: cisplatin is an approved anticancer drug, while its isomer transplatin is inactive. (D) Helical chirality in metal complexes.
Figure 10
Figure 10
Current possibilities to represent molecules with complicated bonds (here ferrocene) Top left: bond-agnostic edges neglect some physical constraints and can be written as Smiles or a graph. Top right: separation of σ- and π-electron systems. Bottom left: Dietz representation. Bottom right: zero-order bonds.
Figure 11
Figure 11
An example of a molecular transformer, which uses Smiles to represent and transform reactant and agent molecules into the product of the reaction, as used by Schwaller et al. The tokenization of the Smiles is shown by the bold characters separated with spaces.
Figure 12
Figure 12
In most cases, the changes happening during the reaction affect only a small fraction of the molecule, and everything else is left unchanged However, current representations, like reaction Smiles, do not capture that, and major parts of the molecules are actually repeated. In contrast, condensed graphs of representation (CGRs) represent the bond changes in the reactions. To generate a CGR from a reaction Smiles, the atom mapping has to be determined first. Agents and conditions are not shown in the figure.
Figure 13
Figure 13
Graphs can be represented in numerous ways, for example using images, adjacency matrices, or strings All of them are graph representations. By relating string-based representations to programming languages, we show that they are in general the most expressive representations. For Selfies, B1 and R1 are abbreviations for Branch1 and Ring1, respectively.
Figure 14
Figure 14
Pasithea, the DeepDreaming generative model While the model continuously decreases the loss, the molecule changes in discrete steps. The target property was logP of the molecule. The network is able to increase or decrease the molecular property almost steadily, which indicates a certain “understanding” of the representation. Image from Shen et al.
Figure 15
Figure 15
DECIMER and STOUT A framework for translating images or strings to Smiles. Experiments show that the application of Selfies as an intermediate representation improves the results, which indicates that ML models find it easier to read and write Selfies compared with Smiles. These indications are surprising because it is not clear how the model exploits Selfies’s robustness to improve results. Image from Rajan et al.,

References

    1. Zubatiuk T., Isayev O. Development of multimodal machine learning potentials: toward a physics-aware artificial intelligence. Acc. Chem. Res. 2021;54:1575–1585. - PubMed
    1. Huang B., von Lilienfeld O.A. Ab initio machine learning in chemical compound space. Chem. Rev. 2021;121:10001–10036. - PMC - PubMed
    1. Behler J. Four generations of high-dimensional neural network potentials. Chem. Rev. 2021;121:10037–10072. - PubMed
    1. Westermayr J., Marquetand P. Machine learning for electronically excited states of molecules. Chem. Rev. 2021;121:9873–9926. - PMC - PubMed
    1. Keith J.A., Vassilev-Galindo V., Cheng B., Chmiela S., Gastegger M., Müller K.R., Tkatchenko A. Combining machine learning and computational chemistry for predictive insights into chemical systems. Chem. Rev. 2021;121:9816–9872. - PMC - PubMed