Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep 1;29(17):2186-94.
doi: 10.1093/bioinformatics/btt359. Epub 2013 Jul 4.

Towards building a disease-phenotype knowledge base: extracting disease-manifestation relationship from literature

Affiliations

Towards building a disease-phenotype knowledge base: extracting disease-manifestation relationship from literature

Rong Xu et al. Bioinformatics. .

Abstract

Motivation: Systems approaches to studying phenotypic relationships among diseases are emerging as an active area of research for both novel disease gene discovery and drug repurposing. Currently, systematic study of disease phenotypic relationships on a phenome-wide scale is limited because large-scale machine-understandable disease-phenotype relationship knowledge bases are often unavailable. Here, we present an automatic approach to extract disease-manifestation (D-M) pairs (one specific type of disease-phenotype relationship) from the wide body of published biomedical literature.

Data and methods: Our method leverages external knowledge and limits the amount of human effort required. For the text corpus, we used 119 085 682 MEDLINE sentences (21 354 075 citations). First, we used D-M pairs from existing biomedical ontologies as prior knowledge to automatically discover D-M-specific syntactic patterns. We then extracted additional pairs from MEDLINE using the learned patterns. Finally, we analysed correlations between disease manifestations and disease-associated genes and drugs to demonstrate the potential of this newly created knowledge base in disease gene discovery and drug repurposing.

Results: In total, we extracted 121 359 unique D-M pairs with a high precision of 0.924. Among the extracted pairs, 120 419 (99.2%) have not been captured in existing structured knowledge sources. We have shown that disease manifestations correlate positively with both disease-associated genes and drug treatments.

Conclusions: The main contribution of our study is the creation of a large-scale and accurate D-M phenotype relationship knowledge base. This unique knowledge base, when combined with existing phenotypic, genetic and proteomic datasets, can have profound implications in our deeper understanding of disease etiology and in rapid drug repurposing.

Availability: http://nlp.case.edu/public/data/DMPatternUMLS/

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The overall experiment flowchart
Fig. 2.
Fig. 2.
The knowledge-driven D-M relationship extraction approach
Fig. 3.
Fig. 3.
The correlation between D-M pairs and disease–drug treatment pairs
Fig. 4.
Fig. 4.
The correlation between D-M pairs (extracted from MEDLINE and from UMLS) and disease-associated genes from OMIM
Fig. 5.
Fig. 5.
The correlation between D-M pairs (extracted from MEDLINE) and disease-associated genes from GWAS

References

    1. Ananiadou S, et al. Event extraction for systems biology by text mining the literature. Trends Biotechnol. 2010;28:381–390. - PubMed
    1. Barabsi AL, et al. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 2011;12:56–68. - PMC - PubMed
    1. Baudot A, et al. Translational disease interpretation with molecular networks. Genome Biol. 2009;10:221. - PMC - PubMed
    1. Blaschke C, et al. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999;7:60–67. - PubMed
    1. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Suppl 1):D267–D270. - PMC - PubMed