Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Dec 2;11 Suppl 4(Suppl 4):S24.
doi: 10.1186/1471-2164-11-S4-S24.

Algorithms and semantic infrastructure for mutation impact extraction and grounding

Affiliations

Algorithms and semantic infrastructure for mutation impact extraction and grounding

Jonas B Laurila et al. BMC Genomics. .

Abstract

Background: Mutation impact extraction is a hitherto unaccomplished task in state of the art mutation extraction systems. Protein mutations and their impacts on protein properties are hidden in scientific literature, making them poorly accessible for protein engineers and inaccessible for phenotype-prediction systems that currently depend on manually curated genomic variation databases.

Results: We present the first rule-based approach for the extraction of mutation impacts on protein properties, categorizing their directionality as positive, negative or neutral. Furthermore protein and mutation mentions are grounded to their respective UniProtKB IDs and selected protein properties, namely protein functions to concepts found in the Gene Ontology. The extracted entities are populated to an OWL-DL Mutation Impact ontology facilitating complex querying for mutation impacts using SPARQL. We illustrate retrieval of proteins and mutant sequences for a given direction of impact on specific protein properties. Moreover we provide programmatic access to the data through semantic web services using the SADI (Semantic Automated Discovery and Integration) framework.

Conclusion: We address the problem of access to legacy mutation data in unstructured form through the creation of novel mutation impact extraction methods which are evaluated on a corpus of full-text articles on haloalkane dehalogenases, tagged by domain experts. Our approaches show state of the art levels of precision and recall for Mutation Grounding and respectable level of precision but lower recall for the task of Mutant-Impact relation extraction. The system is deployed using text mining and semantic web technologies with the goal of publishing to a broad spectrum of consumers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Extraction and grounding framework. Full-text documents (1) are run through a GATE pipeline with gazetteers derived from Swiss-Prot (2) and created with MutationFinder (3). Mutations and proteins are grounded (4). Protein properties are extracted with use of MuNPEx and custom JAPE rules (5) and grounded to the Gene Ontology when applicable. The impact extractor (6) makes use of the previous annotations to establish relations between mutants and impacts on protein properties. The output consists of annotated text (8).
Figure 2
Figure 2
Rules for impact classification.
Figure 3
Figure 3
Mutation impact ontology structure. Visualization of top level concepts as Mutation Specification, Protein, Mutation Impact and Protein Property being connected through object properties. Detailed descriptions of the concepts are provided in Table 2.
Figure 4
Figure 4
SPARQL query and answers. A SPARQL query expressing the natural language question “Which proteins have been mutated so that there is a negative impact on haloalkane dehalogenase activity and what are the sequences of the corresponding mutants?” is shown to the left. The first four answers (result rows) are displayed to the right.
Figure 5
Figure 5
Mutation impact knowledge flow. The text-to-entity SADI service uses the text mining pipeline to extract mutations and impacts from a given text. The results are saved in an RDF triple store. The triple store can then be interrogated, either by a user through a SPARQL endpoint or by a second layer of entity-to-entity SADI services that in turn can be accessed through a SADI client.

References

    1. Nishikawa K, Ishino S, Takenaka H, Norioka N, Hirai T, Yao T, Seto Y. Constructing a protein mutant database. Protein Eng. 1993;7(5):733. doi: 10.1093/protein/7.5.733. - DOI - PubMed
    1. Cotton RG, Horaitis O. The Challenge of Documenting Mutation Across the Genome: The Hu-man Genome Variation Society Approach. Hum Mutat. 2004;23:447–452. doi: 10.1002/humu.20038. - DOI - PubMed
    1. Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R, Casari G, Kirsch H. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res. 2004;32:135–142. doi: 10.1093/nar/gkh162. - DOI - PMC - PubMed
    1. Horn F, Lau AL, Cohen FE. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics. 2004;20:557–568. doi: 10.1093/bioinformatics/btg449. - DOI - PubMed
    1. Baker CJO, Witte R. Mutation Mining-A Prospector's Tale. Information Systems Frontiers. 2006;8:47–57. doi: 10.1007/s10796-006-6103-2. - DOI

Publication types

LinkOut - more resources