. 2014 Jun 10:2014:bau050.

doi: 10.1093/database/bau050. Print 2014.

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

Thomas C Wiegers¹, Allan Peter Davis², Carolyn J Mattingly²

Affiliations

¹ Department of Biological Sciences, North Carolina State University, 139 David Clark Lab, Campus Box 7617, Raleigh, NC 27695-7617, USA tcwieger@ncsu.edu twiegers@mdibl.org.
² Department of Biological Sciences, North Carolina State University, 139 David Clark Lab, Campus Box 7617, Raleigh, NC 27695-7617, USA.

PMID: 24919658
PMCID: PMC4207221
DOI: 10.1093/database/bau050

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

Thomas C Wiegers et al. Database (Oxford). 2014.

. 2014 Jun 10:2014:bau050.

doi: 10.1093/database/bau050. Print 2014.

Authors

Thomas C Wiegers¹, Allan Peter Davis², Carolyn J Mattingly²

Affiliations

¹ Department of Biological Sciences, North Carolina State University, 139 David Clark Lab, Campus Box 7617, Raleigh, NC 27695-7617, USA tcwieger@ncsu.edu twiegers@mdibl.org.
² Department of Biological Sciences, North Carolina State University, 139 David Clark Lab, Campus Box 7617, Raleigh, NC 27695-7617, USA.

PMID: 24919658
PMCID: PMC4207221
DOI: 10.1093/database/bau050

Abstract

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/

PubMed Disclaimer

Figures

**Figure 1.**
Web service-based NER logical design. Under a Web service-based conceptual design, (1) a list of potentially relevant PubMed IDs (PMIDs) is secured via a search of PubMed, typically for a target chemical. (2) The list is processed asynchronously by batch-oriented processes. Rather than performing NER using locally installed NER tools, (3) HTTP calls containing text passages are made to remote Web services; the results of NER are used as a key component in document ranking algorithms. (4) PMIDs are then assigned a DRS by the document ranking algorithms.

**Figure 2.**
BioC-based high-level inter-process communications. A sample request in BioC format is sent by Web service from the text-mining (TM) pipeline to the NER tool (green arrow). The PubMed ID, title, abstract and designated key file describing the semantics of the data are included within the XML request (left, green box). A chemical-specific response is returned from the NER tool to the TM pipeline (blue arrow). The NER Web service reads the BioC XML and attempts to identify chemicals in the title and abstract. Here, two chemical entities (fenfluramine and dexfenfluramine) are identified as BioC annotation objects for the NER chemical category in the response (right, blue boxes).

**Figure 3.**
BioCreative IV Track 3 NER Testing Facility. Participants were provided with the *BioCreative IV Track 3 NER Testing Facility* developed by CTD. This testing facility provided a front-end to a CTD Web service that on execution called the participant's Web service using BioC XML associated with a specified PubMed ID for inter-process communications (top left screenshot). CTD's Web service would in turn receive text-mined annotations from the participant's Web service (using BioC XML). CTD's Web service then processed the annotations and computed the results against the curated data set, providing the user with recall, precision, response time and a detailed list of curated terms, text-mined terms and text-mined term hits (bottom right screenshot).

**Figure 4.**
Gene/protein named-entity recognition. Gene recall (blue), precision (red) and balanced F-score (green) results are shown for each participating group (anonymously identified by group number on x-axis). Average scores for each metric (dotted lines) are also provided.

**Figure 5.**
Balanced F-scores by group. Balanced F-score results for each NER category, as well as a combined average, are provided for each participating group (anonymously identified by group number on x-axis). Average scores for each metric (dotted lines) are also provided.

**Figure 6.**
Response times. Response time results for each NER category, as well as a combined average, are provided for each participating group (anonymously identified by group number on x-axis). *Note*: the response time in seconds (y-axis) uses a logarithmic scale.

**Figure 7.**
Chemical/drug named-entity recognition. Chemical recall (blue), precision (red) and balanced F-score (green) results are shown for each participating group (anonymously identified by group number on x-axis). Average scores for each metric (dotted lines) are also provided.

**Figure 8.**
Disease named-entity recognition. Disease recall (blue), precision (red) and balanced F-score (green) results are shown for each participating group (anonymously identified by group number on x-axis). Average scores for each metric (dotted lines) are also provided.

**Figure 9.**
Action term named-entity recognition. Chemical/gene action term recall (blue), precision (red) and balanced F-score (green) results are shown for each participating group (anonymously identified by group number on x-axis). Average scores for each metric (dotted lines) are also provided.

**Figure 10.**
Recall and precision. Combined average recall (x-axis) and precision (y-axis) results are shown for each participating group (color-coded by group number) within major NER category. For some groups there appeared to be a clear trade-off between recall and precision (e.g. 203), whereas for other groups trade-offs were less apparent (e.g. 184 and 199).

**Figure 11.**
Balanced F-score and response time. Combined average balanced F-score (x-axis) and response time (y-axis) results are shown for each participating group (color-coded by group number) within major NER category. There was no clear relationship between response time and F-score. *Note*: the response time in seconds (y-axis) uses a logarithmic scale.

See this image and copyright information in PMC

References

1. Davis A.P., Murphy C.G., Johnson R., et al. (2013) The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res., 41, D1104–D1114 - PMC - PubMed
1. Davis A.P., Wiegers T.C., Murphy C.G., et al. (2011) The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database. Database, 2011, bar034. - PMC - PubMed
1. Davis A.P., Wiegers T.C., Rosenstein M.C., et al. (2012) MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database. Database, 2012, bar065. - PMC - PubMed
1. Amberger J., Bocchini C., Hamosh A. (2011) A new face and new challenges for online mendelian inheritance in man (omim(r)). Hum Mutat., 32, 564–567 - PubMed
1. Coletti M.H., Bleich H.L. (2001) Medical Subject Headings used to search the biomedical literature. J. Am. Med. Inform. Assoc., 8, 317–323 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

Affiliations

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources