Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Apr 17;8(4):e58201.
doi: 10.1371/journal.pone.0058201. Print 2013.

Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database

Affiliations
Comparative Study

Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database

Allan Peter Davis et al. PLoS One. .

Abstract

The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. CTD text mining technical overview.
(1) A triaged corpus is retrieved for a target chemical-of-interest by querying PubMed. (2) Using the PMID, an article's title and abstract are mined for gene, chemical, disease, and action term recognition in CTD's integrated text-mining pipeline (red box). (3) Each text-mined term is first validated against CTD's controlled vocabularies and ignored if a match is not secured. The CTD text-mining pipeline process is run on a Red Hat Enterprise Linux 6.2 operating system using primarily Java 1.6 within the context of asynchronous batch processes. (4) PMIDs are then assigned a document relevancy score (DRS) by the text-mining tool and (5) sent to biocurators. (6) All interactions are composed and entered in CTD's web-based Curation Tool with the client running HTML 5, CSS3, JavaScript 1.85, and Ajax; a server processes the interactions and stores them in the Curation Database using Tomcat 6.0, Java 1.6, Servlet 2.5, JSP/JSTL, and Spring 3.0 framework.
Figure 2
Figure 2. Document workflow.
(1) Independent CTD-specific queries were made of PubMed to retrieve 14,904 articles for the seven heavy metals cadmium, cobalt, copper, lead, manganese, mercury, and nickel. (2) These articles were text mined and assigned a document relevancy score (DRS). (3) Of this preliminary corpus, 1,020 articles were found to have been previously reviewed in CTD and were used as a test set to evaluate the DRS and determine suitable cut-offs. (4) Articles with DRS ≥100 (high), DRS ≤20 (low), and a subset with DRS between 21–99 (medium) were combined to provide a final corpus of 3,583 documents which was then (5) sent to five CTD biocurators (who were kept blind to the DRS of each article) for review. (6) Biocurators timed themselves while reviewing all articles and ultimately rejected 1,381 (as non-curatable for CTD) and curated 2,202 of them (7) from whence 41,208 chemical-gene-disease interactions were extracted.
Figure 3
Figure 3. Test set of previously reviewed articles validates assigned DRS.
A total of 1,020 articles are distributed by their text-mining assigned DRS (binned in 20-unit increments, x-axis) and are indicated as to whether they were found to have been either curated (green) or rejected (gray) by a CTD biocurator (as percent of articles in bin) at a previous time. The number of articles in each DRS bin (n) appears at the top of each column. There were no articles for the bins 280–299, 340–359, or 360–379.
Figure 4
Figure 4. Curation of heavy metal corpus validates assigned DRS.
Of the original 14,904 articles (boxes in top row, N), a representative set of 3,583 documents (second row, n) were assigned to CTD biocurators for curatorial review, including all articles (1,981) with a high DRS ≥100, all articles (723) with a low DRS ≤20, and the complete subset of the articles (879) with a medium DRS 21–99 for the heavy metal mercury. (The 1,020 previously reviewed articles were not included in the assigned set.) The articles are distributed by their text-mining assigned DRS (binned in 20-unit increments, x-axis) and are indicated as to whether they were either curated (green) or rejected (gray) by a CTD biocurator (as percent of articles in bin). There is a progressive decrease in the percentage of curated articles with DRS <100. In total, 1,685 of the 1,981 articles (85%) with a high DRS ≥100 were curatable, while only 111 of the 723 articles (15%) with a low DRS ≤20 could be curated.
Figure 5
Figure 5. DRS reflects the number of interactions per curated article.
Biocurators extracted 41,208 interactions from 2,202 curated articles (top row, c). The average number of interactions per curated article (log-scale, y-axis) is distributed by the assigned DRS (binned in 20-unit increments, x-axis), with the number of curated articles (c) in each bin indicated at the top. The average number of interactions per curated article increases with the DRS. The aberrant spike in bin 240–259 is due to a single article (amongst a total of nine curated documents in the bin) from whence 5,977 interactions were curated from a microarray experiment.
Figure 6
Figure 6. DRS effectively ranks articles for relevance.
The 3,583 text-mined articles were ranked via (A) each article's PubMed identification number (PMID) in descending order and via (B) the text-mining assigned DRS, with articles grouped into progressive quartiles (Q1–Q4), each containing 896 documents. The articles were reviewed by CTD biocurators who determined that 2,202 of the articles contained relevant data (curated, green bars) while 1,381 of them did not (rejected, gray bars). The percent of total curated papers vs. rejected papers for each unique quartile are shown.
Figure 7
Figure 7. DRS effectively ranks articles for data content.
A total of 38,118 novel interactions are distributed into progressive quartiles (Q1–Q4) based upon either DRS ranking (blue) or PMID ranking (orange) for three different types of interactions: (A) 35,385 novel chemical-gene (C–G) interactions, (B) 1,549 novel chemical-disease (C–D) interactions, and (C) 1,184 novel gene-disease (G–D) interactions.
Figure 8
Figure 8. DRS effectively ranks articles for productivity.
(A) The number of total interactions (both novel and repeated) for each quartile is divided by (B) the time spent on curating them to produce (C) an averaged interaction yield rate (interactions per minute) for each quartile.
Figure 9
Figure 9. Disease category distribution for the seven heavy metals.
The number of diseases curated for each metal is indicated for cadmium (Cd), cobalt (Co), copper (Cu), lead (Pb), manganese (Mn), mercury (Hg), and nickel (Ni). These specific disorders were then mapped and distributed across 21 generic disease categories (legend at top) using CTD's MEDIC-Slim disease mappings to look for overrepresented disease classes for each individual heavy metal. For example, of the 70 specific diseases associated with copper (Cu), 23 of them (33%) are nervous system disorders and 12 of them (17%) are cardiovascular disorders.

References

    1. Mattingly CJ, Rosenstein MC, Davis AP, Colby GT, Forrest JN, et al. (2006) The Comparative Toxicogenomics Database: a cross-species resource for building chemical-gene interaction networks. Toxicol Sci 92: 587–595. - PMC - PubMed
    1. Davis AP, Murphy CG, Johnson R, Lay JM, Lennon-Hopkins K, et al.. (2013) The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res doi:10.1093/nar/gks994. - PMC - PubMed
    1. Davis AP, King BL, Mockus S, Murphy CG, Saraceni-Richards C, et al. (2011) The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res 39: D1067–1072. - PMC - PubMed
    1. Davis AP, Murphy CG, Saraceni-Richards CA, Rosenstein MC, Wiegers TC, et al. (2009) Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res 37: D786–792. - PMC - PubMed
    1. Davis AP, Murphy CG, Rosenstein MC, Wiegers TC, Mattingly CJ (2008) The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study. BMC Med Genomics 1: 48. - PMC - PubMed

Publication types

LinkOut - more resources