Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Feb;3(2):e65.
doi: 10.1371/journal.pbio.0030065.

Facts from text--is text mining ready to deliver?

Affiliations

Facts from text--is text mining ready to deliver?

Dietrich Rebholz-Schuhmann et al. PLoS Biol. 2005 Feb.

Abstract

The mining of information from scientific literature using computational tools has tremendous potential for knowledge discovery, but how close are we to realizing this potential?

PubMed Disclaimer

Figures

Figure 1
Figure 1. Medline Article Deluge
This figure shows the exploding number of articles available from Medline over the past 65 years (data retrieved from the SRS server at the European Bioinformatics Institute; www.ebi.ac.uk/). In 2003, about 560,000 articles were added to Medline, and from 2000 to 2003, 2 million articles. (Articles already registered for 2005 are given as well.)
Figure 2
Figure 2. Zipf's Law
Zipf's eponymous law is illustrated by the analysis of 30,000 Medline abstracts (4,952,878 occurrences of words; 144,841 different words). Frequent terms account for a large portion of the text, but a large fraction of terms appear at a low frequency and often only once (69,782 words appear only once). Zipf was a linguistic professor at Harvard University [3].
Figure 3
Figure 3. GOAnnotator
The illustrated software tool brings together data from text mining and from databases to support curators in the GO annotation of proteins (Couto FM, Lee V, Dimmer E, Camon E, Apweiler R, et al., unpublished data). Here a protein is shown in conjunction with the GO terms that have been gathered from various databases and attributed to the protein through electronic annotation. Both are evaluated against similar GO terms extracted from text documents. The curator looks into the evidence and decides whether any of the GO terms extracted from the documents should be assigned to the protein.

References

    1. Briscoe T, Carroll J. Robust Accurate statistical annotation of general text; Proceedings of the Third International Conference on Language Resources and Evaluation; 2002 May; Canary Islands, Spain: European Language Resources Association; 2002. pp. 1499–1504.
    1. Pyysalo S, Ginter F, Pahikkala T, Koivula J, Boberg J, et al. In: Collier N, Ruch P, Nazarenko, editors. Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions; Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications; 2004 August 28–29; Geneva, Switzerland. 2004. pp. 15–21.
    1. Zipf GK. Selective studies and the principle of relative frequency in language. Cambridge (Massachusetts): MIT Press; 1932. 1 v.
    1. Gene Ontology Consortium. Creating the gene ontology resource: Design and implementation. Genome Res. 2001;11:1425–1433. - PMC - PubMed
    1. Müller HM, Kenny EE, Sternberg PW. Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2004;2:e309. - PMC - PubMed

MeSH terms