Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 6;9(1):23.
doi: 10.1186/s13326-018-0189-6.

Using predicate and provenance information from a knowledge graph for drug efficacy screening

Affiliations

Using predicate and provenance information from a knowledge graph for drug efficacy screening

Wytze J Vlietstra et al. J Biomed Semantics. .

Abstract

Background: Biomedical knowledge graphs have become important tools to computationally analyse the comprehensive body of biomedical knowledge. They represent knowledge as subject-predicate-object triples, in which the predicate indicates the relationship between subject and object. A triple can also contain provenance information, which consists of references to the sources of the triple (e.g. scientific publications or database entries). Knowledge graphs have been used to classify drug-disease pairs for drug efficacy screening, but existing computational methods have often ignored predicate and provenance information. Using this information, we aimed to develop a supervised machine learning classifier and determine the added value of predicate and provenance information for drug efficacy screening. To ensure the biological plausibility of our method we performed our research on the protein level, where drugs are represented by their drug target proteins, and diseases by their disease proteins.

Results: Using random forests with repeated 10-fold cross-validation, our method achieved an area under the ROC curve (AUC) of 78.1% and 74.3% for two reference sets. We benchmarked against a state-of-the-art knowledge-graph technique that does not use predicate and provenance information, obtaining AUCs of 65.6% and 64.6%, respectively. Classifiers that only used predicate information performed superior to classifiers that only used provenance information, but using both performed best.

Conclusion: We conclude that both predicate and provenance information provide added value for drug efficacy screening.

Keywords: Computational pharmacology; Drug efficacy screening; Drug repurposing; Knowledge graph; Machine learning; Predicate; Provenance; Systems pharmacology.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
The three included relationship scenarios. The three scenarios of relationships between drug targets and disease proteins are shown along with examples which can be found in the knowledge graph. a Drug target (DT) and disease protein (DP) are the same protein. The protein may have a relationship with itself (dotted line). b DT and DP have a direct relationship. c DT and DP have an indirect relationship through an intermediate protein (IP). Indirect relationships consist of two steps (DTIP and IPDP)
Fig. 2
Fig. 2
Schematic overview of the feature extraction and classification process. For the sake of readability, this overview figure only shows the process for predicates. The input set contains the combinations of drug targets (DT) and disease proteins (DP) that are to be classified. Step 1: Extract paths. The paths between drug targets and disease proteins are extracted from the knowledge graph. Paths can be direct or indirect. Indirect paths have one intermediate protein (IP) and are separated in two steps: DTIP (drug target – intermediate protein) and IPDP (intermediate protein – disease protein). Step 2: Extract features. The feature set consists of all possible predicates and provenance, for each of the three scenarios (cf. Fig. 1). Based on the extracted paths for a combination, the presence or absence of each feature is set. Step 3: Classify. Based on the extracted features, the combinations are classified by a random forest classifier
Fig. 3
Fig. 3
The most important features for a cross-validation experiment. The top-20 most important features when trained on the complete feature set are presented. The importance measures, calculated with the standard feature importance calculation function of the random forest algorithm, have been normalized. The colours indicate whether it is a predicate, provenance, or overlap feature. While knowledge sources such as SemMedDB contain information about relationships between many types of entities, we only used the protein-protein interaction (PPI) subsets of these datasets

References

    1. Ehrlinger L, Wöß W. Towards a definition of knowledge graphs. CEUR Workshop Proc. 2016;1695
    1. Manola F, Miller E. W3C.org Triple specification. [cited 2018 Jun 4]. Available from: https://www.w3.org/TR/rdf-concepts/#dfn-rdf-triple
    1. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39:691–697. doi: 10.1093/nar/gkq1018. - DOI - PMC - PubMed
    1. Chen H, Ding L, Wu Z, Yu T, Dhanapalan L, Chen JY. Semantic web for integrated network analysis in biomedicine. Brief Bioinform. 2009;10:177–192. doi: 10.1093/bib/bbp002. - DOI - PubMed
    1. Vlietstra WJ, Zielman R, van Dongen RM, Schultes EA, Wiesman F, Vos R, et al. Automated extraction of potential migraine biomarkers using a semantic graph. J Biomed Inform. 2017;71:178–189. doi: 10.1016/j.jbi.2017.05.018. - DOI - PubMed

Publication types