Five-way smoking status classification using text hot-spot identification and error-correcting output codes

Aaron M Cohen¹

Affiliations

Affiliation

¹ Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 3181 S.W. Sam Jackson Park Road, Mail Code: BICC, Portland, OR, 97239-3098, USA. cohenaa@ohsu.edu

PMID: 17947623
PMCID: PMC2274879
DOI: 10.1197/jamia.M2434

Five-way smoking status classification using text hot-spot identification and error-correcting output codes

Aaron M Cohen. J Am Med Inform Assoc. 2008 Jan-Feb.

. 2008 Jan-Feb;15(1):32-5.

doi: 10.1197/jamia.M2434. Epub 2007 Oct 18.

Author

Aaron M Cohen¹

Affiliation

¹ Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 3181 S.W. Sam Jackson Park Road, Mail Code: BICC, Portland, OR, 97239-3098, USA. cohenaa@ohsu.edu

PMID: 17947623
PMCID: PMC2274879
DOI: 10.1197/jamia.M2434

Abstract

We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

PubMed Disclaimer

References

1. Sebastiani F. Machine learning in automated text categorization ACM Computing Surveys (CSUR) 2002;34:1-47.
1. Dietterich TG, Bakiri G. Solving multiclass learning problems via error-correcting output codes J Artif Intell Res 1995:263-286.
1. Ghani R. Using error-correcting codes for text classificationIn: Langley P, editor. Proceedings of {ICML}-00, 17th International Conference on Machine Learning; 2000. San Francisco, US: Morgan Kaufmann Publishers; 2000. pp. 303-310.
1. Dietterich TG. Ensemble methods in machine learning Lecture Notes in Computer Science 2000;1857:1-15.
1. Chang C-C, Lin C-J. LIBSVM: a library for support vector machines, 2001Software available athttp://www.csie.ntu.edu.tw/∼cjlin/libsvm 2000. Accessed March 20, 2006.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Five-way smoking status classification using text hot-spot identification and error-correcting output codes

Affiliation

Five-way smoking status classification using text hot-spot identification and error-correcting output codes

Author

Affiliation

Abstract

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources