Using text to build semantic networks for pharmacogenomics

Adrien Coulet¹, Nigam H Shah, Yael Garten, Mark Musen, Russ B Altman

Affiliations

PMID: 20723615
PMCID: PMC2991587
DOI: 10.1016/j.jbi.2010.08.005

Using text to build semantic networks for pharmacogenomics

Adrien Coulet et al. J Biomed Inform. 2010 Dec.

. 2010 Dec;43(6):1009-19.

doi: 10.1016/j.jbi.2010.08.005. Epub 2010 Aug 17.

Authors

Adrien Coulet¹, Nigam H Shah, Yael Garten, Mark Musen, Russ B Altman

Affiliation

¹ Department of Medicine, 300 Pasteur Drive, Room S101, Mail Code 5110, Stanford University, Stanford, CA 94305, USA.

PMID: 20723615
PMCID: PMC2991587
DOI: 10.1016/j.jbi.2010.08.005

Abstract

Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.

PubMed Disclaimer

Figures

**Figure 1**
Overview of our method to extract pharmacogenomics (PGx) relationships from text. The method has four steps. 1. We parse the text (Medline abstracts in this work) with the Stanford Parser to yield the Dependency Graph data structure that provides the syntactical structure of each sentence. 2. We identify PGx entities and their *raw relationships*—“raw” because their subject, object and type use natural language terms. 3. We processed these raw relationships to build *(first run only)* or refine *(next runs)* an ontology of PGx relationships. 4. For each of the raw relationships, we map them to the ontology and express them in normalized form. Normalized relationships create a network in which nodes are PGx entities and edges are relationships, both of which are associated with a precise semantics.

**Figure 2**
Sample parse tree of the sentence “*Several single nucleotide polymorphisms (SNPs) in* VKORC1 *are associated with warfarin dose across the normal dose range.*” (PubMed ID 17161452). This parse tree is obtained when querying an index (built in previous work) with query (1) that looks for two pharmacogenomics key entities: *VKORC1 (a gene)* and *warfarin (a drug)*.

**Figure 3**
The Stanford Parser creates a Dependency Graph (DG) data structure from the parse tree, such as this one corresponding to the parse tree in Figure 2. Its two seeds are *VKORC1* and *warfarin*, and its root is *associated*. Solid lines represent the path that connects both seeds to each other via the root. This path is used in the next step to extract the following raw relationship: associated(*VKORC1_polymorphisms*, *warfarin_dose*).

**Figure 4**
Four steps of recognizing and expanding the two seeds in the example sentence shown in Figures 2 and 3. Starting with the seed entities, VKORC1 and warfarin, we use the rules provided in Table 1 to traverse the Dependency Graph in Figure 3 to recognize the subject (VKORC1_polymorphisms), object (warfarin_dose) and relationship (associated) in the Dependency Graph.

**Figure 5**
A raw relationship derived from the dependency graph has three components: relationship type, subject and object. Both subject and object can be either a single PGx key entity (*e.g.*, *warfarin*) or a modified entity using the key entity as a modifier (*e.g.*, *VKORC1_expression*).

**Figure 6**
Three raw relationships normalized to two normalized expressions, using the PHARE (PHArmacogenomics RElationship) ontology of entities and relationships. The content of this ontology is described in Section 4. In this example, the first two raw relationships express the same relationship, according to the mappings in our ontology (e.g. drug dose and drug requirement are declared synonyms in the ontology). The third raw relationship maps to a more specific relationship (increases), which is a child of the more general (associated) relationship.

**Figure 7**
Starting with the text *“differences in warfarin requirements”,* we extracted the raw entity “*warfarin_requirements_differences*” and then apply normalization using the PHARE ontology. The first step ensures that the standard name for warfarin is used (here, Coumadin would have been mapped to warfarin, had it been used). *Warfarin* is the seed and the concept associated with it, noted C_seed, is Drug according to the ontology. The second step maps “requirements” to the standard ontological concept of dose, and the final step maps “differences” to the ontology concept of variation. Having learned these mappings on our initial training corpus, we can apply them broadly and prospectively to new sentences.

**Figure 8**
Two semantic networks extracted for the VKORC1 gene. (a) Displays Pharmacogenomics (PGx) relationships extracted from sentences that contain VKORC1 or one of its synonyms as a key entity. Thus, for example, it shows that VKORC1 predicts warfarin drug dose. (b) Displays PGx relationships for entities that are modified by VKORC1 (e.g. VKORC1_haplotype, VKORC1_variant). Thus, for example, VKORC1 haplotypes influence warfarin drug effect. Each node represents a PGx key or modified entity, *e.g. warfarin* or *warfarin_drug_effect*. Edges represent relationships between these entities that are mentioned in MEDLINE abstracts. When several sentences mention a relationship between the same two entities, the edge is wider and is labeled with the most frequent types of relationship. Networks have been generated using Cytoscape v2.6.3 (http://www.cytoscape.org).

**Figure 9**
A summary of the Pharmacogenomics (PGx) Concept Network. Nodes represent concepts frequently appearing in PGx relationships. Their size is dependent on the number of instantiated PGx entities. Edges represent relationships between instances of two concepts. Their width is dependent on their number. This network has been built from the knowledge base of 41,134 relationships extracted from the text of Medline abstracts. Thus, for example, there are many statements in the PGx literature relating drugs to genes, and genes to diseases. There are somewhat less relating drug metabolism specifically to genomic variation. This network has been generated using Cytoscape v2.6.3 (http://www.cytoscape.org).

See this image and copyright information in PMC

References

1. Klein T, Chang J, Cho M, Easton K, Fergerson R, Hewett M, Lin Z, Liu Y, Liu S, Oliver D, Rubin D, Shafa F, Stuart J, Altman R. Integrating genotype and phenotype information: An overview of the PharmGKB project. The Pharmacogenomics Journal. 1:167–170. - PubMed
1. Garten Y, Altman RB. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics. 10(S-2) - PMC - PubMed
1. Li J, Zhu X, Chen JY. Building disease-specific drug–protein connectivity maps from molecular interaction networks and pubmed abstracts. PLoS Comput Biol. 2009;5(7):e1000450. - PMC - PubMed
1. Blaschke C, Andrade MA, Ouzounis C, Valencia A. Automatic Extraction of Biological Information from Scientific Text: Protein–Protein Interactions. ISMB; 1999. pp. 60–67. - PubMed
1. Rosario B, Hearst MA. Classifying semantic relations in bioscience texts. ACL; 2004. pp. 430–437.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using text to build semantic networks for pharmacogenomics

Affiliation

Using text to build semantic networks for pharmacogenomics

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources