BioC: a minimalist approach to interoperability for biomedical text processing
- PMID: 24048470
- PMCID: PMC3889917
- DOI: 10.1093/database/bat064
BioC: a minimalist approach to interoperability for biomedical text processing
Abstract
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/
Figures
Similar articles
-
BioC interoperability track overview.Database (Oxford). 2014 Jun 30;2014:bau053. doi: 10.1093/database/bau053. Print 2014. Database (Oxford). 2014. PMID: 24980129 Free PMC article. Review.
-
tmBioC: improving interoperability of text-mining tools with BioC.Database (Oxford). 2014 Jul 25;2014:bau073. doi: 10.1093/database/bau073. Print 2014. Database (Oxford). 2014. PMID: 25062914 Free PMC article.
-
Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.Database (Oxford). 2014 Jun 9;2014:bau044. doi: 10.1093/database/bau044. Print 2014. Database (Oxford). 2014. PMID: 24914232 Free PMC article.
-
BioC implementations in Go, Perl, Python and Ruby.Database (Oxford). 2014 Jun 23;2014:bau059. doi: 10.1093/database/bau059. Print 2014. Database (Oxford). 2014. PMID: 24961236 Free PMC article.
-
A survey on annotation tools for the biomedical literature.Brief Bioinform. 2014 Mar;15(2):327-40. doi: 10.1093/bib/bbs084. Epub 2012 Dec 18. Brief Bioinform. 2014. PMID: 23255168 Review.
Cited by
-
BioC viewer: a web-based tool for displaying and merging annotations in BioC.Database (Oxford). 2016 Aug 10;2016:baw106. doi: 10.1093/database/baw106. Print 2016. Database (Oxford). 2016. PMID: 27515823 Free PMC article.
-
SIA: a scalable interoperable annotation server for biomedical named entities.J Cheminform. 2018 Dec 14;10(1):63. doi: 10.1186/s13321-018-0319-2. J Cheminform. 2018. PMID: 30552534 Free PMC article.
-
Overview of the BioCreative VI text-mining services for Kinome Curation Track.Database (Oxford). 2018 Jan 1;2018:bay104. doi: 10.1093/database/bay104. Database (Oxford). 2018. PMID: 30329035 Free PMC article.
-
PubTator central: automated concept annotation for biomedical full text articles.Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389. Nucleic Acids Res. 2019. PMID: 31114887 Free PMC article.
-
Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.BMC Bioinformatics. 2018 Mar 9;19(1):94. doi: 10.1186/s12859-018-2103-8. BMC Bioinformatics. 2018. PMID: 29523070 Free PMC article.
References
-
- Devlin K. Logic and Information. Cambridge, UK: Cambridge University Press; 1991.
-
- TEI: Text Encoding Initiative. http://www.tei-c.org/index.xml (January 2013, date last accessed)
-
- Grishman R. 1995 Tipster Phase II Architecture Design Document, version 1.52.
-
- Bird S, Day D, Garofolo JS, et al. ATLAS: a flexible and extensible architecture for linguistic annotation. CoRR. 2000;cs.CL/0007022:1–8.
-
- Cunningham H, Maynard D, Bontcheva K, et al. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics; 2002. GATE: an architecture for development of robust HLT applications; pp. 168–175.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources