. 2015 Jul 28;11(7):e1004216.

doi: 10.1371/journal.pcbi.1004216. eCollection 2015 Jul.

Learning the Structure of Biomedical Relationships from Unstructured Text

Bethany Percha¹, Russ B Altman²

Affiliations

¹ Biomedical Informatics Training Program, Stanford University, Stanford, California, United States of America.
² Departments of Medicine, Genetics and Bioengineering, Stanford University, Stanford, California, United States of America.

PMID: 26219079
PMCID: PMC4517797
DOI: 10.1371/journal.pcbi.1004216

Learning the Structure of Biomedical Relationships from Unstructured Text

Bethany Percha et al. PLoS Comput Biol. 2015.

. 2015 Jul 28;11(7):e1004216.

doi: 10.1371/journal.pcbi.1004216. eCollection 2015 Jul.

Authors

Bethany Percha¹, Russ B Altman²

Affiliations

¹ Biomedical Informatics Training Program, Stanford University, Stanford, California, United States of America.
² Departments of Medicine, Genetics and Bioengineering, Stanford University, Stanford, California, United States of America.

PMID: 26219079
PMCID: PMC4517797
DOI: 10.1371/journal.pcbi.1004216

Abstract

The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Example of a dependency graph for a Medline 2013 sentence.**
(a) The raw sentence. (b) The complete dependency graph for the sentence. (c) The dependency path connecting the gene CYP3A4 with the drug rifampicin. (d) A more compact representation of the dependency path.

Fig 2. Classifier performance at the task of recognizing (a) PGx associations (dense matrix), (b) drug-target associations (dense matrix), (c) PGx associations (sparse matrix) and (d) drug-target associations (sparse matrix).

**Fig 3. Example of ITCC output for a small matrix consisting of drug-CYP3A4 pairs and their associated dependency paths.**
The top heatmap shows the original data after the clustering was performed. An orange square represents an observed path (column) between a given drug-gene pair (row). The bottom heatmap shows the approximate distribution arising from a single ITCC run.

**Fig 4. Dendrogram illustrating the semantic relationships among 3514 drug-gene pairs.**
In this dendrogram, the leaves represent 3514 drug-gene pairs that co-occur in Medline sentences at least 5 times, and we have cut the dendrogram at various levels (illustrated by the red lines in the interior of the dendrogram) to produce the colored clusters shown around the edges. Drug-gene pairs that are known drug-target relationships from DrugBank are denoted by blue dots, and those that are known PGx relationships from PharmGKB are denoted by orange dots. The heights of the turquoise bars are proportional to how often the corresponding drug-gene pairs co-occur in Medline sentences (a proxy for how well-documented they are).

**Fig 5. Dendrogram illustrating predictions of novel PGx and drug-target relationships among 3514 drug-gene pairs.**
The height of the bars corresponds to EBC's certainty that the pair in question represents a relationship of the corresponding type (orange: PGx relationships, blue: drug-target relationships). The dots represent known PGx and drug-target relationships, as in Fig 4.

See this image and copyright information in PMC

References

1. http://www.nlm.nih.gov/bsd/num_titles.html. Accessed 3/3/14.
1. http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html. Accessed 3/3/14.
1. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledge base of human genes and genetic disorders. Nucleic Acids Res 33(Suppl 1): D514–D517. - PMC - PubMed
1. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, et al. (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Suppl 1): D668–D672. - PMC - PubMed
1. Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, et al. (2012) Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther 92: 414–417. 10.1038/clpt.2012.96 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning the Structure of Biomedical Relationships from Unstructured Text

Affiliations

Learning the Structure of Biomedical Relationships from Unstructured Text

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical