Comparative Study

. 2009 Oct 8:10:326.

doi: 10.1186/1471-2105-10-326.

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD)

Thomas C Wiegers¹, Allan Peter Davis, K Bretonnel Cohen, Lynette Hirschman, Carolyn J Mattingly

Affiliations

PMID: 19814812
PMCID: PMC2768719
DOI: 10.1186/1471-2105-10-326

Comparative Study

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD)

Thomas C Wiegers et al. BMC Bioinformatics. 2009.

. 2009 Oct 8:10:326.

doi: 10.1186/1471-2105-10-326.

Authors

Thomas C Wiegers¹, Allan Peter Davis, K Bretonnel Cohen, Lynette Hirschman, Carolyn J Mattingly

Affiliation

¹ Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, ME, USA. twiegers@mdibl.org

PMID: 19814812
PMCID: PMC2768719
DOI: 10.1186/1471-2105-10-326

Abstract

Background: The Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage.

Results: Prototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking).

Conclusion: This text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.

PubMed Disclaimer

Figures

**Figure 1**
**CTD curated data relationships**. Biocurators capture three types of data relationships from the literature using controlled vocabularies, including chemical-gene interactions, and chemical-disease and gene/protein-disease relationships. These three relationships generate a chemical-gene/protein-disease triad that enables users to infer novel connections between all three actors.

**Figure 2**
**Documentation of curated data**. a) Currently curated data are captured using controlled vocabularies in Excel spreadsheets that include: Curator ID, date of curation, PubMed identification number, interaction (designated using a CTD coding schema), species in which the interaction was observed, interacting chemical, interacting gene/protein, associated diseases (not shown) and author contact information for follow-up purposes (not shown). b) Codes used to capture interactions are translated into readable sentences for the public web application.

**Figure 3**
**Rules-based ranking of articles enhances yield of curated data**. When ranked using the rules-based application vs. PubMed ordering (control case), the top 10% of articles would result in an increased yield of curated data; specifically 426 more chemical-gene interactions, comprising 82 additional genes, 81 additional chemicals and 5 more diseases.

**Figure 4**
**Text mining improves the ranking of journal articles for curation**. A test set of 354 articles slated for curation were first ranked by two different methods: (a) via each article's PubMed identification number in descending order (which typically reflects the publication date from newest to oldest paper) and (b) via the rank order determined by our rule-based text-mining application. The articles were then reviewed by a biocurator who determined that 167 of the papers contained relevant data (curated, black bars) while 187 of them did not (rejected, white bars). For presentation, the 354 articles are grouped into progressive quartiles (1st, 2nd, 3rd, and 4th) each containing 89 papers. The overall percent of total curated papers (167) vs. rejected papers (187) are shown distributed over each quartile. The text-mining tool (b) effectively ranked the more relevant papers into the first and second quartile and the less relevant papers to the third and fourth quartile compared to the less informed criteria of PubMed identification numbers (a).

**Figure 5**
**Future CTD manual curation workflow**. Articles will continue to be identified for curation using PubMed and chemical terms of interest. Articles will be text mined using chemical (OSCAR 3), gene (ABNER) and disease (MetaMap) identifiers as described. Actors identified by text mining will be matched against vocabularies in CTD and journal articles without matches will be removed. Remaining journal articles will be ranked and loaded into the CTD curation database. Biocurators will curate or reject journal articles using an online application tool that is integrated with the CTD curation and production databases. Curated data will be approved and loaded into the CTD production database.

See this image and copyright information in PMC

References

1. Toscano WA, Oehlke KP. Systems Biology: New Approaches to Old Environmental Health Problems. Int J Environ Res Public Health. 2004;2:84–90. - PMC - PubMed
1. Davis AP, Murphy CG, Rosenstein MC, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study. BMC Med Genomics. 2008;1:48. doi: 10.1186/1755-8794-1-48. - DOI - PMC - PubMed
1. Davis AP, Murphy CG, Saraceni-Richards CA, Rosenstein MC, Wiegers TC, Mattingly CJ. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 2009:D786–792. doi: 10.1093/nar/gkn580. - DOI - PMC - PubMed
1. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004:D258–261. - PMC - PubMed
1. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008:D480–484. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD)

Affiliation

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD)

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources