Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 17;11(1):3601.
doi: 10.1038/s41467-020-17266-6.

Automated extraction of chemical synthesis actions from experimental procedures

Affiliations

Automated extraction of chemical synthesis actions from experimental procedures

Alain C Vaucher et al. Nat Commun. .

Abstract

Experimental procedures for chemical synthesis are commonly reported in prose in patents or in the scientific literature. The extraction of the details necessary to reproduce and validate a synthesis in a chemical laboratory is often a tedious task requiring extensive human intervention. We present a method to convert unstructured experimental procedures written in English to structured synthetic steps (action sequences) reflecting all the operations needed to successfully conduct the corresponding chemical reactions. To achieve this, we design a set of synthesis actions with predefined properties and a deep-learning sequence to sequence model based on the transformer architecture to convert experimental procedures to action sequences. The model is pretrained on vast amounts of data generated automatically with a custom rule-based natural language processing approach and refined on manually annotated samples. Predictions on our test set result in a perfect (100%) match of the action sequence for 60.8% of sentences, a 90% match for 71.3% of sentences, and a 75% match for 82.4% of sentences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Visualization of the correctness of predicted action types.
The action types predicted by the transformer model (labels on the x-axis) are compared to the actual action types of the ground truth (labels on the y-axis). This figure is generated by first counting all the correctly predicted action types (values on the diagonal); these values correspond to the column "Type match'' of Table 5. Then, the off-diagonal elements are determined from the remaining (incorrectly predicted) actions. Thereby, the last row and column gather actions that are present only in the predicted set or ground truth, respectively. For clarity, the color scale stops at 10, although many elements (especially on the diagonal) exceed this value.
Fig. 2
Fig. 2. Statistics of the Pistachio and annotation datasets.
a Distribution of the number of characters for sentences from Pistachio and from the annotation dataset. b Distribution of the number of actions per sentence. For the Pistachio dataset, this number is computed from the actions extracted by the rule-based model. For sentences from the annotation dataset, this number is determined from the ground truth (hand annotations). c Distribution of action types extracted by the rule-based model on the Pistachio dataset and on the annotated dataset. The action types are ordered by decreasing frequency for the Pistachio dataset. d Distribution of action types determined from hand annotations for the full annotation dataset and its test split. The action types are ordered by decreasing frequency for the full annotation dataset.
Fig. 3
Fig. 3. Distribution of action types of the annotation test set.
The action types are ordered by decreasing frequency for the hand annotations.
Fig. 4
Fig. 4. Screenshots for adding and editing actions with the annotation framework.
The sentence to annotate is displayed on the left-hand side, with the corresponding pre-annotations on the right-hand side. A Wash action is missing and can be added by clicking on the corresponding button at the top. Also, when clicking on the appropriate button, a new page open to edit the selected action.

References

    1. Peplow M. Organic synthesis: the robo-chemist. Nature. 2014;512:20–22. doi: 10.1038/512020a. - DOI - PubMed
    1. Trobe M, Burke MD. The molecular industrial revolution: automated synthesis of small molecules. Angew. Chem. Int. Ed. 2018;57:4192–4214. doi: 10.1002/anie.201710482. - DOI - PMC - PubMed
    1. Steiner S, et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science. 2019;363:eaav2211. doi: 10.1126/science.aav2211. - DOI - PubMed
    1. Coley CW, et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science. 2019;365:eaax1566. doi: 10.1126/science.aax1566. - DOI - PubMed
    1. Segler MHS, Preuss M, Waller MP. Planning chemical syntheses with deep neural networks and symbolic AI. Nature. 2018;555:604–610. doi: 10.1038/nature25978. - DOI - PubMed