Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug 9:11:418.
doi: 10.1186/1471-2105-11-418.

eGIFT: mining gene information from the literature

Affiliations

eGIFT: mining gene information from the literature

Catalina O Tudor et al. BMC Bioinformatics. .

Abstract

Background: With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results: In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene's literature to its frequency of occurrence in documents about genes in general. To retrieve a gene's documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT's information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT's iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions: Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
eGRAB module for retrieving Medline abstracts about a gene. Gene names and aliases are gathered from EntrezGene to retrieve Medline abstracts. eGRAB filters out abstracts that mention an ambiguous gene name in some other context. This is done by creating bigram language models for each sense of an ambiguous name and picking the language model that best fits an abstract. DPP is used here to illustrate the ambiguity of gene names (DPP stands for two different genes - Decapentaplegic and Dentin Phosphoprotein - as well as a technique - Differential polarography, among multiple other senses).
Figure 2
Figure 2
Identifying iTerms. iTerms are obtained by ranking important terms based on a score which combines their Background Set and About Set document frequencies. We also attempt to eliminate lexical redundancies among iTerms before displaying them.
Figure 3
Figure 3
Partial iTerms of type functions/processes for gene F11R (JAM-1 or JAM-A). Extra information can be obtained for an iTerm, by clicking the arrow to the left of it (see adhesion above). This information includes textual variants, most co-occurring adjacent terms (with which it forms bigrams), and frequencies in the Background and Query/About Sets. Ranked sentences can be retrieved by clicking on the term (see leukocyte transmigration above).

Similar articles

Cited by

References

    1. McEntyre J, Lipman D. PubMed: bridging the information gap. Canadian Medical Association Journal. 2001;164(9):1317–1319. http://www.ncbi.nlm.nih.gov/sites/entrez - PMC - PubMed
    1. BioMed Central. http://www.biomedcentral.com/
    1. Andrade MA, Valencia A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics. 1998;14(7):600–607. doi: 10.1093/bioinformatics/14.7.600. - DOI - PubMed
    1. Liu Y, Brandon M, Navathe S, Dingledine R, Ciliax BJ. Text mining functional keywords associated with genes. MedInfo. 2004;11:292–296. - PubMed
    1. Kaczanowski S, Siedlecki P, Zielenkewicz P. The High Throughput Sequence Annotation Service (HT-SAS) - the shortcut from sequence to true Medline words. BMC Bioinformatics. 2009;10:148–154. doi: 10.1186/1471-2105-10-148. - DOI - PMC - PubMed

Publication types