Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jan 31:9:78.
doi: 10.1186/1471-2105-9-78.

OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Affiliations

OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Lawrence Hunter et al. BMC Bioinformatics. .

Abstract

Background: Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.

Results: OpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26-.72 (precision .39-.85, recall .16-.85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances.

Conclusion: OpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at http://bionlp.sourceforge.net/

PubMed Disclaimer

Figures

Figure 1
Figure 1
OpenDMAP coverage of MEDLINE. The gray bars indicate the number of journals indexed by MEDLINE each year. The red bars indicate the number of journal abstracts from which OpenDMAP extracted at least one assertion regarding transport, interaction or expression. In recent years, more than 40% of biomedical journals contain such information. 2007 is partial data (through July 1).
Figure 2
Figure 2
Screenshot of the Protégé ontology for the protein transport task. The slots of the protein transport class are shown in the lower right panel of this screen shot. Note that the subclasses of Cellular Component and Protein Transport are not shown.

References

    1. Sparck Jones K. Natural language processing: A historical review. Current Issues in Computational Linguistics: in Honour of Don Walker (Ed Zampolli, Calzolari and Palmer), Amsterdam: Kluwer. 1994.
    1. Rebholz-Schuhmann D, Kirsch H, Couto F. Facts from text -- is text mining ready to deliver? PLoS Biol. 2005;3:e65. doi: 10.1371/journal.pbio.0030065. - DOI - PMC - PubMed
    1. Hoffmann R, Valencia A. Nat Genet. 2004/07/01. Vol. 36. 2004. A gene network for navigating the literature; p. 664. - DOI - PubMed
    1. Shah PK, Jensen LJ, Boué S, Bork P. Extraction of transcript diversity from scientific literature. PLoS Comput Biol. 2005;1:e10. doi: 10.1371/journal.pcbi.0010010. - DOI - PMC - PubMed
    1. Horn F, Lau AL, Cohen FE. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics. 2004;20:557–568. doi: 10.1093/bioinformatics/btg449. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources