Recognizing names in biomedical texts: a machine learning approach

GuoDong Zhou¹, Jie Zhang, Jian Su, Dan Shen, ChewLim Tan

Affiliations

PMID: 14871877
DOI: 10.1093/bioinformatics/bth060

Comparative Study

Recognizing names in biomedical texts: a machine learning approach

GuoDong Zhou et al. Bioinformatics. 2004.

. 2004 May 1;20(7):1178-90.

doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Authors

GuoDong Zhou¹, Jie Zhang, Jian Su, Dan Shen, ChewLim Tan

Affiliation

¹ Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613. zhougd@i2r.a-star.edu.sg

PMID: 14871877
DOI: 10.1093/bioinformatics/bth060

Abstract

Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition.

Results: In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k-Nearest Neighbor (k-NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F-measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F-measure of 75.8 on the "protein" class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F-measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F-measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the "protein" class, without help of any dictionaries. We think that a F-measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the "protein" class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary.

Availability: A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm. Technology license is available upon the bilateral agreement.

PubMed Disclaimer

Cited by

Evaluating the state-of-the-art in automatic de-identification.
Uzuner O, Luo Y, Szolovits P. Uzuner O, et al. J Am Med Inform Assoc. 2007 Sep-Oct;14(5):550-63. doi: 10.1197/jamia.M2444. Epub 2007 Jun 28. J Am Med Inform Assoc. 2007. PMID: 17600094 Free PMC article.
Towards reliable named entity recognition in the biomedical domain.
Giorgi JM, Bader GD. Giorgi JM, et al. Bioinformatics. 2020 Jan 1;36(1):280-286. doi: 10.1093/bioinformatics/btz504. Bioinformatics. 2020. PMID: 31218364 Free PMC article.
Recognition of protein/gene names from text using an ensemble of classifiers.
Zhou G, Shen D, Zhang J, Su J, Tan S. Zhou G, et al. BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S7. doi: 10.1186/1471-2105-6-S1-S7. Epub 2005 May 24. BMC Bioinformatics. 2005. PMID: 15960841 Free PMC article.
NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.
Tsai RT, Sung CL, Dai HJ, Hung HC, Sung TY, Hsu WL. Tsai RT, et al. BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11. BMC Bioinformatics. 2006. PMID: 17254295 Free PMC article.
Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity.
Park B, Altieri N, DeNero J, Odisho AY, Yu B. Park B, et al. JAMIA Open. 2021 Sep 30;4(3):ooab085. doi: 10.1093/jamiaopen/ooab085. eCollection 2021 Jul. JAMIA Open. 2021. PMID: 34604711 Free PMC article.

See all "Cited by" articles

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Recognizing names in biomedical texts: a machine learning approach

Affiliation

Recognizing names in biomedical texts: a machine learning approach

Authors

Affiliation

Abstract

Similar articles

Cited by

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous