Linking genes to literature: text mining, information extraction, and retrieval applications for biology

Martin Krallinger¹, Alfonso Valencia, Lynette Hirschman

Affiliations

PMID: 18834499
PMCID: PMC2559992
DOI: 10.1186/gb-2008-9-s2-s8

Review

Linking genes to literature: text mining, information extraction, and retrieval applications for biology

Martin Krallinger et al. Genome Biol. 2008.

. 2008;9 Suppl 2(Suppl 2):S8.

doi: 10.1186/gb-2008-9-s2-s8. Epub 2008 Sep 1.

Authors

Martin Krallinger¹, Alfonso Valencia, Lynette Hirschman

Affiliation

¹ Structural Biology and BioComputing Programme, Spanish Nacional Cancer Research Centre (CNIO), Madrid, Spain. mkrallinger@cnio.es

PMID: 18834499
PMCID: PMC2559992
DOI: 10.1186/gb-2008-9-s2-s8

Abstract

Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet http://zope.bioinfo.cnio.es/bionlp_tools/.

PubMed Disclaimer

Figures

**Figure 1**
Overview of the main aspects relevant to the development of biomedical literature processing systems. ATCR, *Arabidopsis Thaliana* Circadian Rhythms; EMBASE, Excerpta Medica Database; FMA, Foundational Model of Anatomy; GENIA, GENome Information Acquisition; GO, Gene Ontology; IEPA, Interaction Extraction Performance Assessment; MGI, Mouse Genome Informatics; MO dbs, Model Organism databases; OBO, Open Biomedical Ontologies; RGD, Rat Genome Database; SGD, *Saccharomyces* Genome Database; TAIR, The *Arabidopsis* Information Resource.

**Figure 2**
Main natural language processing levels, from word tokenization to semantics. The different processing layers for a given example sentence are shown here. This example is based on the output generated by the GENIA tagger: DT, determiner; IN, preposition or subordinating conjunction; JJ, adjective; NN, Noun (singular or mass); NNS, Noun (plural); VBZ, Verb (third person singular present). The B/I/O terminology refers to begin phrase (B), internal to phrase (I), and outside of phrase (O).

**Figure 3**
Biomedical text mining applications from the biology user perspective. This figure provides a simplified general overview of some existing biomedical text mining applications from the biology user perspective. The main user query types currently addressed by existing literature processing applications are shown in the center of this figure. The outer circles represent the type of implemented applications as well as some of the corresponding systems. Note that some tools could in principle be associated to several application types (but only one of them is illustrated here). For a more detailed description of the displayed systems refer to the online tool collection repository.

See this image and copyright information in PMC

References

1. Buckingham S. Bioinformatics: data's future shock. Nature. 2004;428:774–777. - PubMed
1. Searls D. Mining the bibliome. Pharmacogenomics J. 2001;1:88–89. - PubMed
1. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004;32:262–266. - PMC - PubMed
1. Galperin M. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Res. 2008;36:D2–D4. - PMC - PubMed
1. Baumgartner W, Cohen K, Fox L, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23:i41–i48. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Linking genes to literature: text mining, information extraction, and retrieval applications for biology

Affiliation

Linking genes to literature: text mining, information extraction, and retrieval applications for biology

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical