Complex event extraction at PubMed scale

Jari Björne¹, Filip Ginter, Sampo Pyysalo, Jun'ichi Tsujii, Tapio Salakoski

Affiliations

PMID: 20529932
PMCID: PMC2881365
DOI: 10.1093/bioinformatics/btq180

Complex event extraction at PubMed scale

Jari Björne et al. Bioinformatics. 2010.

. 2010 Jun 15;26(12):i382-90.

doi: 10.1093/bioinformatics/btq180.

Authors

Jari Björne¹, Filip Ginter, Sampo Pyysalo, Jun'ichi Tsujii, Tapio Salakoski

Affiliation

¹ Department of Information Technology, University of Turku, Turku, Finland. jari.bjorne@utu.fi

PMID: 20529932
PMCID: PMC2881365
DOI: 10.1093/bioinformatics/btq180

Abstract

Motivation: There has recently been a notable shift in biomedical information extraction (IE) from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction.

Results: This study considers event-based IE at PubMed scale. We introduce a system combining publicly available, state-of-the-art methods for domain parsing, named entity recognition and event extraction, and test the system on a representative 1% sample of all PubMed citations. We present the first evaluation of the generalization performance of event extraction systems to this scale and show that despite its computational complexity, event extraction from the entire PubMed is feasible. We further illustrate the value of the extraction approach through a number of analyses of the extracted information.

Availability: The event detection system and extracted data are open source licensed and available at http://bionlp.utu.fi/.

PubMed Disclaimer

Figures

**Fig. 1.**
Event extraction. A multiphased system is used to generate an *event graph*, a formal representation for the semantic content of the sentence. Before event detection, sentences are parsed (A) to generate a suitable syntactic graph to be used in detecting semantic relationships. Event detection starts with identification of named entities (B) with BANNER (parses are not used at this step). Once named entities have been identified, the trigger detector (C) uses them and the parse for predicting triggers, words which define potential events. The edge detector (D) predicts relationship edges (event arguments) between triggers and named entities. Finally, the resulting semantic graph is divided into individual events by (E) duplicating trigger nodes and regrouping argument edges.

**Fig. 2.**
Total number of citations and citations with tagged gene/protein mentions and events in the sample by year.

**Fig. 3.**
Number of citations with tagged mentions of *insulin*, *IgG* and *TNF-alpha* (normalized for capitalization and hyphenization), as well as extracted events of these proteins. The counts are cumulative for every five years to smooth the curves.

**Fig. 4.**
Extracted event network around interleukin-4. This graph shows a subset of the predicted event network, including only named entities with at least 50 extracted instances. The round event nodes are (P)ositive regulation, (N)egative regulation, (R)egulation, gene (E)xpression, (B)inding, p(H)osphorylation and (L)ocalization. For clarity, single-argument events (E, B, H and L) are displayed only when they also act as arguments of regulation events.

See this image and copyright information in PMC

References

1. Airola A, et al. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics. 2008;9(Suppl. 11):S2. - PMC - PubMed
1. Benton N. Scope expands for PubMed® and MEDLINE®. NLM Technical Bulletin. 1999;311
1. Björne J, et al. Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. New York, NY, USA: Association for Computational Linguistics; 2009. Extracting complex biological events with rich graph-based feature sets; pp. 10–18.
1. Chapman WW, Cohen KB. Current issues in biomedical text mining and natural language processing. J. Biomed. Inform. 2009;42:757–759. - PubMed
1. Charniak E, Johnson M. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05). New York, NY, USA: Association for Computational Linguistics; 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking; pp. 173–180.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Complex event extraction at PubMed scale

Affiliation

Complex event extraction at PubMed scale

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources