Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jan 1;32(1):106-13.
doi: 10.1093/bioinformatics/btv476. Epub 2015 Sep 3.

Large-scale extraction of gene interactions from full-text literature using DeepDive

Affiliations

Large-scale extraction of gene interactions from full-text literature using DeepDive

Emily K Mallory et al. Bioinformatics. .

Abstract

Motivation: A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.

Methods: We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions.

Results: Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles.

Availability and implementation: Application source code is publicly available at https://github.com/edoughty/deepdive_genegene_app

Contact: russ.altman@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Gene–gene extraction pipeline. (A) We performed text pre-processing to parse documents into sentences and tokens and to construct dependency graphs between tokens in the sentences. This parsed data were stored in a sentences database. (B) The gene–gene extractor constructed candidate relations from the sentences and deposited them into a database. These relations composed of a pair of genes and features from the sentence. (C) DeepDive calculated probabilities that the candidate relation was an interaction using inference rules based on the features. (D) We performed system tuning to identify and correct system errors. Furthermore, we performed a snowball technique where we input correct relations as new training examples in the next system iteration
Fig. 2.
Fig. 2.
High-level feature patterns. Boxes represent relevant patterns in the sentence for the feature. Light gray boxes indicate features for GeneA and black boxes indicate features for GeneB. Shared boxes are represented with a medium gray box. The feature applies to both genes if only a light gray box is present
Fig. 3.
Fig. 3.
Histogram of probabilities assigned to gene–gene candidate relations
Fig 4.
Fig 4.
Number of publications per year for genes appearing in high probability interactions. Top genes are ordered by publication count in 2013

References

    1. Blohm P., et al. (2014) Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res., 42, D396–D400. - PMC - PubMed
    1. Chatr-Aryamontri A., et al. (2014) The BioGRID interaction database: 2015 update, Nucleic Acids Res., 43, D470–D478. - PMC - PubMed
    1. Chen Y., et al. (2014) An ensemble self-training protein interaction article classifier. Biomed. Mater. Eng., 24, 1323–1332. - PubMed
    1. Czarnecki J., et al. (2012) A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinformatics, 13, 172. - PMC - PubMed
    1. Franceschini A., et al. (2013) STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res., 41, D808–D815. - PMC - PubMed

Publication types