Corpus refactoring: a feasibility study
- PMID: 17854502
- PMCID: PMC2072937
- DOI: 10.1186/1747-5333-2-4
Corpus refactoring: a feasibility study
Abstract
Background: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.
Results: The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.
Conclusion: We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.
Figures



References
-
- Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, Mcnaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. Proceedings of the 10th Panhellenic Conference on Informatics. 2005. pp. 382–392.
-
- Cohen KB, Tanabe L, Kinoshita S, Hunter L. HLT-NAACL 2004 Workshop: BioLINK Linking Biological Literature, Ontologies and Databases. Association for Computational Linguistics; 2004. A resource for constructing customized test suites for molecular biology entity identification systems; pp. 1–8.
-
- Cohen KB, Fox L, Ogren PV, Hunter L. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases. Association for Computational Linguistics; 2005. Corpus design for biomedical natural language processing; pp. 38–45.
-
- Ide N, Brew C. Proceedings of the Data Architectures and Software Support for Large Corpora. European Languages Resources Association; 2000. Requirements, tools, and architectures for annotated corpora; pp. 1–5.
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous