Comparison of character-level and part of speech features for name recognition in biomedical texts

Nigel Collier¹, Koichi Takeuchi

Affiliations

PMID: 15542016
DOI: 10.1016/j.jbi.2004.08.008

Free article

Comparison of character-level and part of speech features for name recognition in biomedical texts

Nigel Collier et al. J Biomed Inform. 2004 Dec.

Free article

. 2004 Dec;37(6):423-35.

doi: 10.1016/j.jbi.2004.08.008.

Authors

Nigel Collier¹, Koichi Takeuchi

Affiliation

¹ National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan. collier@nii.ac.jp

PMID: 15542016
DOI: 10.1016/j.jbi.2004.08.008

Abstract

The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3.

PubMed Disclaimer

Cited by

Information extraction approaches to unconventional data sources for "Injury Surveillance System": the case of newspapers clippings.
Berchialla P, Scarinzi C, Snidero S, Rahim Y, Gregori D. Berchialla P, et al. J Med Syst. 2012 Apr;36(2):475-81. doi: 10.1007/s10916-010-9492-1. Epub 2010 Apr 27. J Med Syst. 2012. PMID: 20703703
Automating curation using a natural language processing pipeline.
Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, Tobin R, Wang X. Alex B, et al. Genome Biol. 2008;9 Suppl 2(Suppl 2):S10. doi: 10.1186/gb-2008-9-s2-s10. Epub 2008 Sep 1. Genome Biol. 2008. PMID: 18834488 Free PMC article.
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.
Munkhdalai T, Li M, Batsuren K, Park HA, Choi NH, Ryu KH. Munkhdalai T, et al. J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015. J Cheminform. 2015. PMID: 25810780 Free PMC article.
Automated recognition of malignancy mentions in biomedical literature.
Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S, Liberman MY, Pereira FC, Winters RS, White PS. Jin Y, et al. BMC Bioinformatics. 2006 Nov 7;7:492. doi: 10.1186/1471-2105-7-492. BMC Bioinformatics. 2006. PMID: 17090325 Free PMC article.
Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation.
Pahikkala T, Ginter F, Boberg J, Järvinen J, Salakoski T. Pahikkala T, et al. BMC Bioinformatics. 2005 Jun 22;6:157. doi: 10.1186/1471-2105-6-157. BMC Bioinformatics. 2005. PMID: 15972097 Free PMC article.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of character-level and part of speech features for name recognition in biomedical texts

Affiliation

Comparison of character-level and part of speech features for name recognition in biomedical texts

Authors

Affiliation

Abstract

Similar articles

Cited by

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources