. 2018 Jul 26;13(7):e0200699.

doi: 10.1371/journal.pone.0200699. eCollection 2018.

Automatic extraction of gene-disease associations from literature using joint ensemble learning

Balu Bhasuran¹, Jeyakumar Natarajan^{1

2}

Affiliations

¹ DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
² Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India.

PMID: 30048465
PMCID: PMC6061985
DOI: 10.1371/journal.pone.0200699

Automatic extraction of gene-disease associations from literature using joint ensemble learning

Balu Bhasuran et al. PLoS One. 2018.

. 2018 Jul 26;13(7):e0200699.

doi: 10.1371/journal.pone.0200699. eCollection 2018.

Authors

Balu Bhasuran¹, Jeyakumar Natarajan^{1

2}

Affiliations

¹ DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
² Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India.

PMID: 30048465
PMCID: PMC6061985
DOI: 10.1371/journal.pone.0200699

Abstract

A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Schematic architecture of the gene-disease relation extraction system.**

**Fig 2. Extraction workflow of the supervised machine learning approach.**

**Fig 3. Feature representation of gene-disease relation extraction.**
a) The sentence is tagged with both LOXL1 gene and Exfoliation glaucoma disease from EU-ADR corpus with PMCID: PMC2605423 b) Word window representation of syntax and semantic features c)Tokens positioned at the left and right (n-gram) of the candidates(LOXL1 and exfoliation glaucoma)d)Locating the words between the entities for relational and trigger words e) Phrasal feature from the relational word f) Finding context specific word using trigger word templates.

Fig 4. Representation of the skip-gram (SG) model with target word gene at the input layer and the learned contextual words like a promoter, susceptibility, protein etc. are in the output layer, adapted from [48].

**Fig 5. ROC with respect to FPR and TPR on four corpora upon 10-fold cross-validation.**
In this figure, a, b, c, and d represents the receiver operating curves of EU-ADR, GAD, CoMAGC and PolySearch corpora respectively.

**Fig 6. Performance evaluation of gene disease relation extraction on four different corpora.**

**Fig 7. Performance comparison of gene disease relation extraction on four different corpora.**

See this image and copyright information in PMC

References

1. Ware M, Mabe M. The STM report: An overview of scientific and scholarly journal publishing. 2015.
1. Neufer PD, Bamman MM, Muoio DM, Bouchard C, Cooper DM, Goodpaster BH, et al. Understanding the Cellular and Molecular Mechanisms of Physical Activity-Induced Health Benefits. Cell Metabolism. 2015. pp. 4–11. 10.1016/j.cmet.2015.05.011 - DOI - PubMed
1. Collins FS, Varmus H. A new initiative on precision medicine. New England Journal of Medicine. 2015. February 26;372(9):793–5. 10.1056/NEJMp1500523 - DOI - PMC - PubMed
1. Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Briefings in bioinformatics. 2017. February 14. - PMC - PubMed
1. Lee S, Kim D, Lee K, Choi J, Kim S, Jeon M, et al. BEST: Next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PLoS One. 2016; 10.1371/journal.pone.0164680 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic extraction of gene-disease associations from literature using joint ensemble learning

Affiliations

Automatic extraction of gene-disease associations from literature using joint ensemble learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources