Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Sep 13:2:4.
doi: 10.1186/1747-5333-2-4.

Corpus refactoring: a feasibility study

Affiliations

Corpus refactoring: a feasibility study

Helen L Johnson et al. J Biomed Discov Collab. .

Abstract

Background: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.

Results: The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.

Conclusion: We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Text block from original PDG corpus. This block of text from the original PDG corpus shows the idiosyncratic format of the protein interaction annotations. "MED" is a deprecated MEDLINE ID. The words that follow "actions" are keywords denoting an interaction type between proteins. The words that follow "Proteins" are the interactors. The text that follows has been altered from the original MEDLINE publication.
Figure 2
Figure 2
Refactored corpus: Word Freak format. Example of the text block from Figure 1 in the refactored WordFreak format. The original sentence reads Here we show that E2F binds to two sequence elements within the P2 promoter of the human MYC gene which are within a region that is critical for promoter activity.
Figure 3
Figure 3
Refactored corpus: embedded XML format. Example of the text block from Figure 1 in the refactored embedded XML format.

References

    1. Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, Mcnaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. Proceedings of the 10th Panhellenic Conference on Informatics. 2005. pp. 382–392.
    1. Cohen KB, Tanabe L, Kinoshita S, Hunter L. HLT-NAACL 2004 Workshop: BioLINK Linking Biological Literature, Ontologies and Databases. Association for Computational Linguistics; 2004. A resource for constructing customized test suites for molecular biology entity identification systems; pp. 1–8.
    1. Cohen KB, Fox L, Ogren P, Hunter L. Empirical data on corpus design and usage in biomedical natural language processing. American Medical Informatics Association Symposium. 2005. pp. 156–160. - PMC - PubMed
    1. Cohen KB, Fox L, Ogren PV, Hunter L. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases. Association for Computational Linguistics; 2005. Corpus design for biomedical natural language processing; pp. 38–45.
    1. Ide N, Brew C. Proceedings of the Data Architectures and Software Support for Large Corpora. European Languages Resources Association; 2000. Requirements, tools, and architectures for annotated corpora; pp. 1–5.

LinkOut - more resources