Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Nov;2(11):e309.
doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

Textpresso: an ontology-based information retrieval and extraction system for biological literature

Affiliations

Textpresso: an ontology-based information retrieval and extraction system for biological literature

Hans-Michael Müller et al. PLoS Biol. 2004 Nov.

Abstract

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no conflicts of interest exist.

Figures

Figure 1
Figure 1. The Process of Marking up a Sentence
The process of marking up the sentence “In par-1, par-4 and par-3 mutant four-cell embryos, MEX-3 is present at high levels in all cells, indicating that activity of these par genes is required to restrict MEX-3 to the anterior.” This sentence is taken from Huang et al. (2002). (A) The computer identifies terms that are stored in a lexicon according to categories of the ontology. A text-to-XML converter marks up the terms by enclosing them in XML brackets. (B) The fully marked-up sentence. Some categories have subcategories (for example, the category “regulation” is subdivided into “positive,” “negative,” and “unknown”). Grammar attributes have been omitted here for the sake of clarity, because they are not used in the current version of the system. Some white spaces have been inserted in the graphics for clarity enhancement.
Figure 2
Figure 2. A Typical Result Page Returned from a Simple Retrieval Query (Keyword)
A simple retrieval was performed with “let-23” as keyword and “regulation,” “cell or cell group,” and “molecular function” as categories. A total of 245 matches were found in 113 publications.
Figure 3
Figure 3. Schema of Small-Scale Information Retrieval Study
Sentences from eight journal articles were both queried by Textpresso and evaluated by a human expert for sentences that described genetic interaction (information retrieval task). In the information extraction task, a human expert inspected the sentences returned by each method to determine the amount of distinct gene-gene interactions that could be extracted in order to analyze the output of the first task.
Figure 4
Figure 4. Schema of Textpresso Database Preparation
The regular hexagons indicate the sources from which Textpresso is built. The rounded rectangles are either intermediate or final processed parts of the corpus. The dashed-dotted rectangles signify automatic processing units or actions.

References

    1. Alper S, Kenyon C. The zinc finger protein REF-2 functions with the Hox genes to inhibit cell fusion in the ventral epidermis of C. elegans . Development. 2002;129:3335–3348. - PubMed
    1. Andrade MA, Bork P. Automated extraction of information in molecular biology. FEBS Lett. 2000;476:12–17. - PubMed
    1. Bei Y, Hogan J, Berkowitz LA, Soto M, Rocheleau CE, et al. SRC-1 and Wnt signaling act together to specify endoderm and to control cleavage orientation in early C. elegans embryos. Dev Cell. 2002;3:113–125. - PubMed
    1. Blaschke C, Valencia A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genomics. 2001;2:196–206. - PMC - PubMed
    1. Blaschke C, Valencia A. Molecular biology nomenclature thwarts information-extraction progress. IEEE Intell Syst. 2002;17:73–76.

Publication types

MeSH terms

Substances