Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction
- PMID: 15564295
- DOI: 10.1093/bioinformatics/bti165
Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction
Abstract
Motivation: Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map.
Results: A 'gold standard' set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the 'gold standard' Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases.
Availability: The pipeline software components are freely available on request to the authors.
Contact: dstates@umich.edu
Supplementary information: http://stateslab.bioinformatics.med.umich.edu/software.html.
Similar articles
-
Recognizing names in biomedical texts: a machine learning approach.Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10. Bioinformatics. 2004. PMID: 14871877
-
RelEx--relation extraction using dependency parse trees.Bioinformatics. 2007 Feb 1;23(3):365-71. doi: 10.1093/bioinformatics/btl616. Epub 2006 Dec 1. Bioinformatics. 2007. PMID: 17142812
-
Gene name ambiguity of eukaryotic nomenclatures.Bioinformatics. 2005 Jan 15;21(2):248-56. doi: 10.1093/bioinformatics/bth496. Epub 2004 Aug 27. Bioinformatics. 2005. PMID: 15333458
-
Wnt signaling through canonical and non-canonical pathways: recent progress.Growth Factors. 2005 Jun;23(2):111-6. doi: 10.1080/08977190500125746. Growth Factors. 2005. PMID: 16019432 Review.
-
Text mining for metabolic pathways, signaling cascades, and protein networks.Sci STKE. 2005 May 10;2005(283):pe21. doi: 10.1126/stke.2832005pe21. Sci STKE. 2005. PMID: 15886388 Review.
Cited by
-
Biocuration with insufficient resources and fixed timelines.Database (Oxford). 2015 Dec 26;2015:bav116. doi: 10.1093/database/bav116. Print 2015. Database (Oxford). 2015. PMID: 26708987 Free PMC article.
-
New challenges for text mining: mapping between text and manually curated pathways.BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2105-9-S3-S5. BMC Bioinformatics. 2008. PMID: 18426550 Free PMC article.
-
A text-mining system for extracting metabolic reactions from full-text articles.BMC Bioinformatics. 2012 Jul 23;13:172. doi: 10.1186/1471-2105-13-172. BMC Bioinformatics. 2012. PMID: 22823282 Free PMC article.
-
A cheminformatic toolkit for mining biomedical knowledge.Pharm Res. 2007 Oct;24(10):1791-802. doi: 10.1007/s11095-007-9285-5. Epub 2007 Mar 24. Pharm Res. 2007. PMID: 17385012 Review.
-
Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.Brief Bioinform. 2025 Jul 2;26(4):bbaf357. doi: 10.1093/bib/bbaf357. Brief Bioinform. 2025. PMID: 40708223 Free PMC article. Review.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources