Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 30;9(9):e107477.
doi: 10.1371/journal.pone.0107477. eCollection 2014.

Annotated chemical patent corpus: a gold standard for text mining

Affiliations

Annotated chemical patent corpus: a gold standard for text mining

Saber A Akhondi et al. PLoS One. .

Abstract

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: AstraZeneca is a global biopharmaceutical company specialising in the discovery, development, manufacturing and marketing of prescription medicines. At the time of the study SM and CT were affiliated to AstraZeneca. NextMove Software Ltd. develops and sells commercial software for text mining. RS and DL were affiliated to NextMove Software Ltd. GVK Biosciences Private Limited is a discovery research and development organization. GVK BIO provides services across the R&D and manufacturing value chain. SARPJ, AKM and KB were affiliated to GVK BIO. The authors confirm that the above statement does not alter their adherence to all PLOS ONE policies on sharing data and materials, as detailed online in guide for authors.

Figures

Figure 1
Figure 1
Example patent text with pre-annotations as shown by the Brat annotation tool.

References

    1. Muresan S, Petrov P, Southan C, Kjellberg MJ, Kogej T, et al. (2011) Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. Drug Discov Today 16: 1019–1030. - PubMed
    1. Southan C, Boppana K, Jagarlapudi SA, Muresan S (2011) Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: Ranking 1654 human protein targets by assayed compounds and molecular scaffolds. J Cheminform 3: 14. - PMC - PubMed
    1. Tyrchan C, Boström J, Giordanetto F, Winter J, Muresan S (2012) Exploiting Structural Information in Patent Specifications for Key Compound Prediction. J Chem Inf Model 52: 1480–1489. - PubMed
    1. Kolarik C, Hofmann-Apitius M, Zimmermann M, Fluck J (2007) Identification of new drug classification terms in textual resources. Bioinformatics 23: i264–272. - PubMed
    1. Klinger R, Kolarik C, Fluck J, Hofmann-Apitius M, Friedrich CM (2008) Detection of IUPAC and IUPAC-like chemical names. Bioinformatics 24: i268–276. - PMC - PubMed

Publication types