Annotated chemical patent corpus: a gold standard for text mining

Affiliations

¹ Department of Medical Informatics, Erasmus University Medical Centre, Rotterdam, The Netherlands.
² Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Fraunhofer-Gesellschaft, Sankt Augustin, Germany.
³ RIA Medicinal Chemistry, AstraZeneca R&D Mölndal, Mölndal, Sweden.
⁴ GVK Biosciences Private Limited, Hyderabad, India.
⁵ NextMove Software Ltd, Cambridge, England.
⁶ Chemistry Innovation Centre, AstraZeneca R&D Mölndal, Mölndal, Sweden.

PMID: 25268232
PMCID: PMC4182036
DOI: 10.1371/journal.pone.0107477

Annotated chemical patent corpus: a gold standard for text mining

Saber A Akhondi et al. PLoS One. 2014.

. 2014 Sep 30;9(9):e107477.

doi: 10.1371/journal.pone.0107477. eCollection 2014.

Authors

Affiliations

¹ Department of Medical Informatics, Erasmus University Medical Centre, Rotterdam, The Netherlands.
² Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Fraunhofer-Gesellschaft, Sankt Augustin, Germany.
³ RIA Medicinal Chemistry, AstraZeneca R&D Mölndal, Mölndal, Sweden.
⁴ GVK Biosciences Private Limited, Hyderabad, India.
⁵ NextMove Software Ltd, Cambridge, England.
⁶ Chemistry Innovation Centre, AstraZeneca R&D Mölndal, Mölndal, Sweden.

PMID: 25268232
PMCID: PMC4182036
DOI: 10.1371/journal.pone.0107477

Abstract

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: AstraZeneca is a global biopharmaceutical company specialising in the discovery, development, manufacturing and marketing of prescription medicines. At the time of the study SM and CT were affiliated to AstraZeneca. NextMove Software Ltd. develops and sells commercial software for text mining. RS and DL were affiliated to NextMove Software Ltd. GVK Biosciences Private Limited is a discovery research and development organization. GVK BIO provides services across the R&D and manufacturing value chain. SARPJ, AKM and KB were affiliated to GVK BIO. The authors confirm that the above statement does not alter their adherence to all PLOS ONE policies on sharing data and materials, as detailed online in guide for authors.

Figures

**Figure 1**
Example patent text with pre-annotations as shown by the Brat annotation tool.

See this image and copyright information in PMC

References

1. Muresan S, Petrov P, Southan C, Kjellberg MJ, Kogej T, et al. (2011) Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. Drug Discov Today 16: 1019–1030. - PubMed
1. Southan C, Boppana K, Jagarlapudi SA, Muresan S (2011) Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: Ranking 1654 human protein targets by assayed compounds and molecular scaffolds. J Cheminform 3: 14. - PMC - PubMed
1. Tyrchan C, Boström J, Giordanetto F, Winter J, Muresan S (2012) Exploiting Structural Information in Patent Specifications for Key Compound Prediction. J Chem Inf Model 52: 1480–1489. - PubMed
1. Kolarik C, Hofmann-Apitius M, Zimmermann M, Fluck J (2007) Identification of new drug classification terms in textual resources. Bioinformatics 23: i264–272. - PubMed
1. Klinger R, Kolarik C, Fluck J, Hofmann-Apitius M, Friedrich CM (2008) Detection of IUPAC and IUPAC-like chemical names. Bioinformatics 24: i268–276. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Annotated chemical patent corpus: a gold standard for text mining

Affiliations

Annotated chemical patent corpus: a gold standard for text mining

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources