Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jan 8:9:10.
doi: 10.1186/1471-2105-9-10.

Corpus annotation for mining biomedical events from literature

Affiliations

Corpus annotation for mining biomedical events from literature

Jin-Dong Kim et al. BMC Bioinformatics. .

Abstract

Background: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation.

Results: We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation.

Conclusion: The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.

PubMed Disclaimer

Figures

Figure 1
Figure 1
GENIA term ontology. The hierarchy of the GENIA term ontology. Terminal classes are used for GENIA term annotation. The figures in parenthesis indicate number of annotation instances made to the GENIA corpus.
Figure 2
Figure 2
Example of event annotation. GENIA event annotation is made sentence by sentence. Although the actual corpus file with annotation is encoded in XML (C), the annotators work on a CSS-styled view (A) which is much more user-friendly. Sometimes, a graphical representation (B) is used to depict annotated events and their relations in an abstract and concise way. Note that the black, red and blue arcs link an event with its themes, causes and location respectively.
Figure 3
Figure 3
GENIA event ontology. The hierarchy of the GENIA event ontology. For event annotation, not only terminal classes but also classes at higher level are allowed to be used. The figures in parenthesis indicate number of annotation instances made to the GENIA corpus.
Figure 4
Figure 4
Graphical representation of events in some example sentences. Examples in text with corresponding event annotation in graphical representation. (A) T-cell expression of the human GATA-3 gene is regulated by a non-lineage-specific silencer. (B) The extent of IFN-induced NK cell killing of E1A-expressing cells was proportional to the level of E1A expression ... (C) Cell hemoglobinization was accompanied by the increased expression of genes encoding gamma-globin ... (D) In addition, forced expression of GATA3 potentiated the induction of RALDH2 by TAL1 and LMO, and these three factors formed a complex in vivo.
Figure 5
Figure 5
SBML-style event description for the example in Figure 2. The nodes denote biological entities. The links denote transitions between different states of entities and correspond to events causing the state transitions.
Figure 6
Figure 6
Graph representations of events about "LMP1 to activate NF-kappa B". (A) expresses the event "LMP1 activates NF-kappa B", and (B) expresses the event "expression of LMP1 activates NF-kappa B". Biological implication of the two expressions is equivalent, i.e. since LMP1 activates NF-kappa B, physical manifestation of LMP1, of course, activates NF-kappa B.
Figure 7
Figure 7
Molecular interactions and signaling pathways engaged by LMP1. LMP1 is involved in the activation of NFkB. Even though it has to get through a complex path for the role of LMP1 to take effect on the activation of NFkB, in natural language text, the involvement of LMP1 for the activation of NFkB is often simply written as "LMP1 activates NFkB." Reprinted from [68], Copyright 2001, with permission from Elsevier.
Figure 8
Figure 8
Screenshot of XConc Suite. The XConc Suite consists of three plug-ins to Eclipse platform: an XML editor (A), a concordancer (B is the query editor and C is the result view), and an ontology browser (D) which support both the editor and the concordancer for the selection of ontology terms.

References

    1. Ananiadou S, McNaught Je. Text Mining for Biology and Biomedicine. Artech House; 2006.
    1. Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986;30:7–18. - PubMed
    1. Swanson D, Smalheiser N. Assessing a gap in the biomedical literature: magnesium deficiency and neurologic disease. Neuroscience Research Communications. 1994;15:1–9.
    1. Blaschke C, Andrade MA, Ouzounis C, Valencia A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol. 1999:60–67. - PubMed
    1. Ono T, Hishigaki H, Tnigami A, Takagi T. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 2001;17:155–161. - PubMed

Publication types

MeSH terms

LinkOut - more resources