BioC: a minimalist approach to interoperability for biomedical text processing

Donald C Comeau¹, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, Alfonso Valencia, Karin Verspoor, Thomas C Wiegers, Cathy H Wu, W John Wilbur

Affiliations

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, Harvard Medical School, Harvard University, Boston, MA 02115 USA, Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO 80045, USA, Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid E-28029, Spain, Center for Bioinformatics and Computational Biology, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, National ICT Australia (NICTA), Victoria Research Laboratory, The University of Melbourne, Parkville VIC 3010, Australia and Department of Biology, North Carolina State University, Raleigh, NC 27695, USA.

PMID: 24048470
PMCID: PMC3889917
DOI: 10.1093/database/bat064

BioC: a minimalist approach to interoperability for biomedical text processing

Donald C Comeau et al. Database (Oxford). 2013.

. 2013 Sep 18:2013:bat064.

doi: 10.1093/database/bat064. Print 2013.

Authors

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, Harvard Medical School, Harvard University, Boston, MA 02115 USA, Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO 80045, USA, Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid E-28029, Spain, Center for Bioinformatics and Computational Biology, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, National ICT Australia (NICTA), Victoria Research Laboratory, The University of Melbourne, Parkville VIC 3010, Australia and Department of Biology, North Carolina State University, Raleigh, NC 27695, USA.

PMID: 24048470
PMCID: PMC3889917
DOI: 10.1093/database/bat064

Abstract

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/

PubMed Disclaimer

Figures

**Figure 3.**
The exampleCollection.xml.

**Figure 4.**
The exampleCollection.key file describing the elements of the exampleCollection.xml file.

**Figure 6.**
The exampleAnnotation.xml.

See this image and copyright information in PMC

References

1. Devlin K. Logic and Information. Cambridge, UK: Cambridge University Press; 1991.
1. TEI: Text Encoding Initiative. http://www.tei-c.org/index.xml (January 2013, date last accessed)
1. Grishman R. 1995 Tipster Phase II Architecture Design Document, version 1.52.
1. Bird S, Day D, Garofolo JS, et al. ATLAS: a flexible and extensible architecture for linguistic annotation. CoRR. 2000;cs.CL/0007022:1–8.
1. Cunningham H, Maynard D, Bontcheva K, et al. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics; 2002. GATE: an architecture for development of robust HLT applications; pp. 168–175.

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BioC: a minimalist approach to interoperability for biomedical text processing

Affiliation

BioC: a minimalist approach to interoperability for biomedical text processing

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources