Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(4):e33427.
doi: 10.1371/journal.pone.0033427. Epub 2012 Apr 12.

Text mining for literature review and knowledge discovery in cancer risk assessment and research

Affiliations

Text mining for literature review and knowledge discovery in cancer risk assessment and research

Anna Korhonen et al. PLoS One. 2012.

Abstract

Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB - a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The Scientific Evidence for Carcinogenic Activity taxonomy branch.
Figure 2
Figure 2. The Mode of Action taxonomy branch.
Figure 3
Figure 3. Example keywords for the Scientific Evidence for Carcinogenic Activity taxonomy.
Figure 4
Figure 4. Example keywords for the Mode of Action taxonomy.
Figure 5
Figure 5. Classification results: number of abstracts and distinct keyword annotations for each label; number of abstracts classified as positive by the system; Precision, Recall and F-measure.
Figure 6
Figure 6. An overview of the CRAB text mining tool.
Figure 7
Figure 7. Illustration of the user interface.
Figure 8
Figure 8. Distribution of classified abstracts over the Scientific Evidence for Carcinogenic Activity taxonomy for two chemicals, benzo[a]pyrene and dibenzo[al]pyrene.
Figure 9
Figure 9. Genotoxic Mode of Action: distribution of classified abstracts for three chemicals: 1,3-butadiene, genistein and formaldehyde.
Figure 10
Figure 10. Non-genotoxic Mode of Action: distribution of classified abstracts for three chemicals: 1,3-butadiene, genistein and formaldehyde.
Figure 11
Figure 11. Comparison of four known genotoxic (left) and four known nongenotoxic (right) chemicals. (b–c) show the distribution in the genotoxic MOA part, (d–e) show the distribution in the nongenotoxic MOA part.
The genotoxic chemicals are 1,3-butadiene, 4-aminobiphenyl, dibenzo[a,l]pyrene and ethylene oxide; the nongenotoxic chemicals are TCDD, PCB126, PCB153 and pentachlorodibenzofuran. * indicates statistically significant differences (formula image, Wilcoxon rank sum test).
Figure 12
Figure 12. Distribution of classified abstracts over the two main MOA classes; genotoxic and nongenotoxic, for 9 antifungal chemicals used as pesticides.
Figure 13
Figure 13. Distribution of classified triazole abstracts over some selected MOA nodes.

References

    1. Hunter L, Cohen KB. Biomedical language processing: What's beyond PubMed? Mol Cell. 2006;21:589–594. - PMC - PubMed
    1. Ananiadou S, McNaught J. Norwood, MA: Artech House, Inc; 2006. Text Mining for Biology And Biomedicine.
    1. Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Briefings in Bioinformatics. 2007;8:358–375. - PMC - PubMed
    1. Cohen KB, Yu H, Bourne PE, Hirschman L. Translating biology: Text mining tools that work. Proceedings of the Pacific Symposium on Biocomputing. 2008. - PMC - PubMed
    1. Altman RB, Bergman CM, Blake J, Blaschke C, Cohen A, et al. Text mining for biology–the way forward: opinions from leading scientists. Genome Biology. 2008;9(Suppl 2) - PMC - PubMed

Publication types