Corpus refactoring: a feasibility study

Helen L Johnson¹, William A Baumgartner Jr, Martin Krallinger, K Bretonnel Cohen, Lawrence Hunter

Affiliations

PMID: 17854502
PMCID: PMC2072937
DOI: 10.1186/1747-5333-2-4

Corpus refactoring: a feasibility study

Helen L Johnson et al. J Biomed Discov Collab. 2007.

. 2007 Sep 13:2:4.

doi: 10.1186/1747-5333-2-4.

Authors

Helen L Johnson¹, William A Baumgartner Jr, Martin Krallinger, K Bretonnel Cohen, Lawrence Hunter

Affiliation

¹ Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO, USA. helen.johnson@uchsc.edu.

PMID: 17854502
PMCID: PMC2072937
DOI: 10.1186/1747-5333-2-4

Abstract

Background: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.

Results: The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.

Conclusion: We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

PubMed Disclaimer

Figures

**Figure 1**
**Text block from original PDG corpus**. This block of text from the original PDG corpus shows the idiosyncratic format of the protein interaction annotations. "MED" is a deprecated MEDLINE ID. The words that follow "actions" are keywords denoting an interaction type between proteins. The words that follow "Proteins" are the interactors. The text that follows has been altered from the original MEDLINE publication.

**Figure 2**
**Refactored corpus: Word Freak format**. Example of the text block from Figure 1 in the refactored WordFreak format. The original sentence reads *Here we show that E2F binds to two sequence elements within the P2 promoter of the human MYC gene which are within a region that is critical for promoter activity*.

**Figure 3**
**Refactored corpus: embedded XML format**. Example of the text block from Figure 1 in the refactored embedded XML format.

See this image and copyright information in PMC

References

1. Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, Mcnaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. Proceedings of the 10th Panhellenic Conference on Informatics. 2005. pp. 382–392.
1. Cohen KB, Tanabe L, Kinoshita S, Hunter L. HLT-NAACL 2004 Workshop: BioLINK Linking Biological Literature, Ontologies and Databases. Association for Computational Linguistics; 2004. A resource for constructing customized test suites for molecular biology entity identification systems; pp. 1–8.
1. Cohen KB, Fox L, Ogren P, Hunter L. Empirical data on corpus design and usage in biomedical natural language processing. American Medical Informatics Association Symposium. 2005. pp. 156–160. - PMC - PubMed
1. Cohen KB, Fox L, Ogren PV, Hunter L. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases. Association for Computational Linguistics; 2005. Corpus design for biomedical natural language processing; pp. 38–45.
1. Ide N, Brew C. Proceedings of the Data Architectures and Software Support for Large Corpora. European Languages Resources Association; 2000. Requirements, tools, and architectures for annotated corpora; pp. 1–5.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Corpus refactoring: a feasibility study

Affiliation

Corpus refactoring: a feasibility study

Authors

Affiliation

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous