Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May 16;3(1):17.
doi: 10.1186/1758-2946-3-17.

ChemicalTagger: A tool for semantic text-mining in chemistry

Affiliations

ChemicalTagger: A tool for semantic text-mining in chemistry

Lezan Hawizy et al. J Cheminform. .

Abstract

Background: The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches.

Results: We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names).

Conclusions: It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Tokenisation.
Figure 2
Figure 2
OSCAR Tagging.
Figure 3
Figure 3
Regex Tagging.
Figure 4
Figure 4
English POS Tagging.
Figure 5
Figure 5
Basic English Syntax Tree. http://en.wikipedia.org/wiki/File:Basic_english_syntax_tree.svg.
Figure 6
Figure 6
AST Output of ANTLR Parse.
Figure 7
Figure 7
Action Phrase Markup.
Figure 8
Figure 8
Graph of Reaction Paths.

References

    1. Bradshaw B, Evans P, Fletcher J, Lee ATL, Mwashimba PG, Oehlrich D, Thomas EJ, Davies RH, Allen BCP, Broadley KJ, Hamrouni A, Escargueil C. Synthesis of 5-hydroxy-2,3,4,5-tetrahydro-[1H]-2-benzazepin-4-ones: selective antagonists of muscarinic (M3) receptors. Organic and Biomolecular Chemistry. 2008;6(12):2138–2157. doi: 10.1039/b801206g. - DOI - PubMed
    1. O'Steen B. IUCr Crystal publication data. http://benosteen.com/timemap/index last accessed: 09/02/11.
    1. Andrade MA, Valencia A. Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology. AAAI Press; 1997. Automatic Annotation for Biological Sequences by Extraction of Keywords from MEDLINE Abstracts: Development of a Prototype System; pp. 25–32. - PubMed
    1. Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Briefings in bioinformatics. 2005;6:57–71. doi: 10.1093/bib/6.1.57. - DOI - PubMed
    1. Nenadić G, Ananiadou S. Mining semantically related terms from biomedical literature. ACM Transactions on ALIP. 2006;5:22–43.