BioCreAtIvE task1A: entity identification with a stochastic tagger

Shuhei Kinoshita¹, K Bretonnel Cohen, Philip V Ogren, Lawrence Hunter

Affiliations

PMID: 15960838
PMCID: PMC1869018
DOI: 10.1186/1471-2105-6-S1-S4

BioCreAtIvE task1A: entity identification with a stochastic tagger

Shuhei Kinoshita et al. BMC Bioinformatics. 2005.

. 2005;6 Suppl 1(Suppl 1):S4.

doi: 10.1186/1471-2105-6-S1-S4. Epub 2005 May 24.

Authors

Shuhei Kinoshita¹, K Bretonnel Cohen, Philip V Ogren, Lawrence Hunter

Affiliation

¹ Center for Computational Pharmacology, University of Colorado School of Medicine, Denver, Colorado, USA. kino@strad.ssg.fujitsu.com

PMID: 15960838
PMCID: PMC1869018
DOI: 10.1186/1471-2105-6-S1-S4

Abstract

Background: Our approach to Task 1A was inspired by Tanabe and Wilbur's ABGene system. Like Tanabe and Wilbur, we approached the problem as one of part-of-speech tagging, adding a GENE tag to the standard tag set. Where their system uses the Brill tagger, we used TnT, the Trigrams 'n' Tags HMM-based part-of-speech tagger. Based on careful error analysis, we implemented a set of post-processing rules to correct both false positives and false negatives. We participated in both the open and the closed divisions; for the open division, we made use of data from NCBI.

Results: Our base system without post-processing achieved a precision and recall of 68.0% and 77.2%, respectively, giving an F-measure of 72.3%. The full system with post-processing achieved a precision and recall of 80.3% and 80.5% giving an F-measure of 80.4%. We achieved a slight improvement (F-measure = 80.9%) by employing a dictionary-based post-processing step for the open division. We placed third in both the open and the closed division.

Conclusion: Our results show that a part-of-speech tagger can be augmented with post-processing rules resulting in an entity identification system that competes well with other approaches.

PubMed Disclaimer

Figures

**Figure 1**
**Precision and Recall.** Figure 1A shows the precision and recall for the cross validation data. Figure 1B shows the precision and recall for the official test data. The expression "w/o post-p" is used as "without post-processing".

**Figure 2**
**Effect of term length on performance.** Figure 2A shows the effect of term length for the cross validation data. Figure 2B shows the effect of term length for the official test data.

See this image and copyright information in PMC

References

1. Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002;18:1124–1132. doi: 10.1093/bioinformatics/18.8.1124. - DOI - PubMed
1. Tanabe L, Wilbur WJ. Tagging gene and protein names in full text articles. Proceedings of the workshop on biomedical natural language processing in the biomedical domain Association for Computational Linguistics. 2002. pp. 9–13.
1. Brants T. TnT – A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000)
1. Fukuda K, Tsunoda T, Tamura A, Takagi T. Toward information extraction: identifying protein names from biological papers. Pacific Symposium for Biocomputing. 1998;3:705–716. - PubMed
1. Fredrik O, Eriksson G, Franzén K, Asker L, Lidén P. Notions of correctness when evaluating protein name taggers. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002) pp. 765–771.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BioCreAtIvE task1A: entity identification with a stochastic tagger

Affiliation

BioCreAtIvE task1A: entity identification with a stochastic tagger

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources