Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT

E Kretschmann¹, W Fleischmann, R Apweiler

Affiliations

PMID: 11673236
DOI: 10.1093/bioinformatics/17.10.920

Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT

E Kretschmann et al. Bioinformatics. 2001 Oct.

. 2001 Oct;17(10):920-6.

doi: 10.1093/bioinformatics/17.10.920.

Authors

E Kretschmann¹, W Fleischmann, R Apweiler

Affiliation

¹ The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. kretsch@ebi.ac.uk

PMID: 11673236
DOI: 10.1093/bioinformatics/17.10.920

Abstract

Motivation: The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.

Results: A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.

Availability: The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint/ Source code is available upon request.

Contact: kretsch@ebi.ac.uk.

PubMed Disclaimer

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT

Affiliation

Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT

Authors

Affiliation

Abstract

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous