Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2008;9 Suppl 2(Suppl 2):S8.
doi: 10.1186/gb-2008-9-s2-s8. Epub 2008 Sep 1.

Linking genes to literature: text mining, information extraction, and retrieval applications for biology

Affiliations
Review

Linking genes to literature: text mining, information extraction, and retrieval applications for biology

Martin Krallinger et al. Genome Biol. 2008.

Abstract

Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet http://zope.bioinfo.cnio.es/bionlp_tools/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the main aspects relevant to the development of biomedical literature processing systems. ATCR, Arabidopsis Thaliana Circadian Rhythms; EMBASE, Excerpta Medica Database; FMA, Foundational Model of Anatomy; GENIA, GENome Information Acquisition; GO, Gene Ontology; IEPA, Interaction Extraction Performance Assessment; MGI, Mouse Genome Informatics; MO dbs, Model Organism databases; OBO, Open Biomedical Ontologies; RGD, Rat Genome Database; SGD, Saccharomyces Genome Database; TAIR, The Arabidopsis Information Resource.
Figure 2
Figure 2
Main natural language processing levels, from word tokenization to semantics. The different processing layers for a given example sentence are shown here. This example is based on the output generated by the GENIA tagger: DT, determiner; IN, preposition or subordinating conjunction; JJ, adjective; NN, Noun (singular or mass); NNS, Noun (plural); VBZ, Verb (third person singular present). The B/I/O terminology refers to begin phrase (B), internal to phrase (I), and outside of phrase (O).
Figure 3
Figure 3
Biomedical text mining applications from the biology user perspective. This figure provides a simplified general overview of some existing biomedical text mining applications from the biology user perspective. The main user query types currently addressed by existing literature processing applications are shown in the center of this figure. The outer circles represent the type of implemented applications as well as some of the corresponding systems. Note that some tools could in principle be associated to several application types (but only one of them is illustrated here). For a more detailed description of the displayed systems refer to the online tool collection repository.

References

    1. Buckingham S. Bioinformatics: data's future shock. Nature. 2004;428:774–777. - PubMed
    1. Searls D. Mining the bibliome. Pharmacogenomics J. 2001;1:88–89. - PubMed
    1. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004;32:262–266. - PMC - PubMed
    1. Galperin M. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Res. 2008;36:D2–D4. - PMC - PubMed
    1. Baumgartner W, Cohen K, Fox L, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23:i41–i48. - PMC - PubMed

Publication types