Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification

Farrokh Mehryary¹, Suwisa Kaewphan², Kai Hakala¹, Filip Ginter³

Affiliations

¹ Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland.
² Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland ; Turku Centre for Computer Science (TUCS), Turku, Finland.
³ Department of Information Technology, University of Turku, Turku, Finland.

PMID: 27175227
PMCID: PMC4864999
DOI: 10.1186/s13326-016-0070-4

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification

Farrokh Mehryary et al. J Biomed Semantics. 2016.

. 2016 May 11:7:27.

doi: 10.1186/s13326-016-0070-4. eCollection 2016.

Authors

Farrokh Mehryary¹, Suwisa Kaewphan², Kai Hakala¹, Filip Ginter³

Affiliations

¹ Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland.
² Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland ; Turku Centre for Computer Science (TUCS), Turku, Finland.
³ Department of Information Technology, University of Turku, Turku, Finland.

PMID: 27175227
PMCID: PMC4864999
DOI: 10.1186/s13326-016-0070-4

Abstract

Background: Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction.

Methods: Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation.

Results: The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database.

Availability: The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.

Keywords: BioNLP; Event extraction; Trigger detection; Word embeddings.

PubMed Disclaimer

Figures

**Fig. 1**
Visualization of a specific event occurrence. Genes and gene products (‘GGPs’) are marked, as well as the trigger words that refer to specific event types. Finally, arrows denote the roles of each argument in the event (e.g. Theme or Cause). (Adapted from [23])

**Fig. 2**
Example sentence with multiple events sharing a single trigger. Two event occurrences extracted from the same trigger word *recognized*

See this image and copyright information in PMC

Cited by

New reasons for biologists to write with a formal language.
Rodriguez-Esteban R. Rodriguez-Esteban R. Database (Oxford). 2022 Jun 3;2022:baac039. doi: 10.1093/database/baac039. Database (Oxford). 2022. PMID: 35657112 Free PMC article.

References

1. Wei CH, Kao HY, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013;41:W518,W522. doi: 10.1093/nar/gks1232. - DOI - PMC - PubMed
1. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP, Kuhn M, Bork P, Jensen LJ, von Mering C. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015;43(D1):447–52. doi: 10.1093/nar/gku1003. - DOI - PMC - PubMed
1. Hakala K, Mehryary F, Kaewphan S, Ginter F. Proceedings of the 5th International Symposium on Languages in Biology and Medicine (LBM’13) Tokyo: Database Center for Life Science; 2013. Hypothesis generation in large-scale event networks.
1. Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J. Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. Boulder, Colorado: Association for Computational Linguistics; 2009. Overview of BioNLP’09 Shared Task on Event Extraction.
1. Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J. Proceedings of the BioNLP Shared Task 2011 Workshop. Portland, Oregon, USA: Association for Computational Linguistics; 2011. Overview of BioNLP Shared Task 2011.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification

Affiliations

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous