. 2016 Jan 1;32(1):106-13.

doi: 10.1093/bioinformatics/btv476. Epub 2015 Sep 3.

Large-scale extraction of gene interactions from full-text literature using DeepDive

Emily K Mallory¹, Ce Zhang², Christopher Ré³, Russ B Altman⁴

Affiliations

¹ Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA.
² Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA.
³ Department of Computer Science.
⁴ Department of Bioengineering, Department of Genetics and Department of Medicine, Stanford University, Stanford, CA 94305, USA.

PMID: 26338771
PMCID: PMC4681986
DOI: 10.1093/bioinformatics/btv476

Large-scale extraction of gene interactions from full-text literature using DeepDive

Emily K Mallory et al. Bioinformatics. 2016.

. 2016 Jan 1;32(1):106-13.

doi: 10.1093/bioinformatics/btv476. Epub 2015 Sep 3.

Authors

Emily K Mallory¹, Ce Zhang², Christopher Ré³, Russ B Altman⁴

Affiliations

¹ Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA.
² Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA.
³ Department of Computer Science.
⁴ Department of Bioengineering, Department of Genetics and Department of Medicine, Stanford University, Stanford, CA 94305, USA.

PMID: 26338771
PMCID: PMC4681986
DOI: 10.1093/bioinformatics/btv476

Abstract

Motivation: A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.

Methods: We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions.

Results: Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles.

Availability and implementation: Application source code is publicly available at https://github.com/edoughty/deepdive_genegene_app

Contact: russ.altman@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Gene–gene extraction pipeline. (A) We performed text pre-processing to parse documents into sentences and tokens and to construct dependency graphs between tokens in the sentences. This parsed data were stored in a sentences database. (B) The gene–gene extractor constructed candidate relations from the sentences and deposited them into a database. These relations composed of a pair of genes and features from the sentence. (C) DeepDive calculated probabilities that the candidate relation was an interaction using inference rules based on the features. (D) We performed system tuning to identify and correct system errors. Furthermore, we performed a snowball technique where we input correct relations as new training examples in the next system iteration

**Fig. 2.**
High-level feature patterns. Boxes represent relevant patterns in the sentence for the feature. Light gray boxes indicate features for GeneA and black boxes indicate features for GeneB. Shared boxes are represented with a medium gray box. The feature applies to both genes if only a light gray box is present

**Fig. 3.**
Histogram of probabilities assigned to gene–gene candidate relations

**Fig 4.**
Number of publications per year for genes appearing in high probability interactions. Top genes are ordered by publication count in 2013

See this image and copyright information in PMC

References

1. Blohm P., et al. (2014) Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res., 42, D396–D400. - PMC - PubMed
1. Chatr-Aryamontri A., et al. (2014) The BioGRID interaction database: 2015 update, Nucleic Acids Res., 43, D470–D478. - PMC - PubMed
1. Chen Y., et al. (2014) An ensemble self-training protein interaction article classifier. Biomed. Mater. Eng., 24, 1323–1332. - PubMed
1. Czarnecki J., et al. (2012) A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinformatics, 13, 172. - PMC - PubMed
1. Franceschini A., et al. (2013) STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res., 41, D808–D815. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large-scale extraction of gene interactions from full-text literature using DeepDive

Affiliations

Large-scale extraction of gene interactions from full-text literature using DeepDive

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources