Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database

Dina Vishnyakova¹, Emilie Pasche, Patrick Ruch

Affiliations

PMID: 23221176
PMCID: PMC3514750
DOI: 10.1093/database/bas050

Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database

Dina Vishnyakova et al. Database (Oxford). 2012.

. 2012 Dec 5:2012:bas050.

doi: 10.1093/database/bas050. Print 2012.

Authors

Dina Vishnyakova¹, Emilie Pasche, Patrick Ruch

Affiliation

¹ Bibliomics and Text Mining Group, Geneva, Switzerland. dina.vishnyakova@hcuge.ch

PMID: 23221176
PMCID: PMC3514750
DOI: 10.1093/database/bas050

Abstract

We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.

PubMed Disclaimer

Figures

**Figure 1**
Workflow of ToxiCat and dependencies with existing online services.

**Figure 2**
This is the starting point and also—if the user decides to click on the final questions generated by the system, see Figure 5—the end point of the search and annotation process. Here, the user can select some PMIDs, which will be then sent to ToxiCat (Figure 3) to be prioritized (Figure 4) and finally processed to generate an annotation (Figures 4 and 5).

**Figure 3**
In this figure, 2-acetylaminofluorene is provided as input chemical compound together with a list of articles (PMIDs) selected in Figure 2. Users can go directly to this page if the PMIDs have been obtained from other sources.

**Figure 4**
The three selected PMIDs are ranked according to the statistical estimate (Score) computed by the SVM binary classifier. Each information extraction module (Gene, Chemical, Disease) provides here a list of descriptors for each PMID together with some meta-data (Journal name, Title, etc.), which are used as features by the classifier.

**Figure 5**
The user can request to visualize in the abstract the context of the annotation proposed in Figure 4. Toxicat tags genes/proteins, chemicals and diseases in the abstract, providing a direct link to the CTD database for each of these entities. Finally, ToxiCat generates a set of questions (‘More…’) based on the entities that were earlier extracted. Optionally, the user can then return to the EAGLi’s question-answering engine to obtain more information. The user can also obtain a list of Gene Ontology descriptors proposed by the GOCat Gene Ontology categorizer (http://eagl.unige.ch/GOCat/) based on the content of the PMID, cf. last line of the table.

**Figure 6**
Aggregated scores of all participants in Track-I. ToxiCat is denoted under Team 120 (5).

See this image and copyright information in PMC

References

1. Hirschman L, Yeh A, Blaschke C, et al. BioCreAtIvE I contest overview. BMC Bioinformatics. 2005;6(Suppl. 1):S1. - PMC - PubMed
1. Ruch P. Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics. 2006;22:658–664. - PubMed
1. Bauer MA, Berleant D. Usability survey of biomedical question answering systems. Human Genomics. 2012;6:17+. - PMC - PubMed
1. Lu Z, Kao HY, Wei CH, et al. The gene normalization task in BioCreative III. BMC Bioinformatics. 2011;12(Suppl. 8):S2. - PMC - PubMed
1. Wiegers T. Collaborative biocuration-text mining development task for document prioritization for curation. Proceedings of BioCreative 2012. 2012 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database

Affiliation

Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources