Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep-Oct;18(5):631-8.
doi: 10.1136/amiajnl-2010-000022. Epub 2011 Jun 27.

Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection

Affiliations

Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection

Taxiarchis Botsis et al. J Am Med Inform Assoc. 2011 Sep-Oct.

Abstract

Objective: The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload.

Design: We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (N(pos)=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations.

Measurements: Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed.

Results: Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively).

Conclusion: Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

Figure 1
Figure 1
Initially medical officers use specific MedDRA preferred terms (PT) or other keywords to extract the Vaccine Adverse Event Reporting System (VAERS) case reports of interest (usually a few thousand). Manual review requires two steps: (i) review of each case report (mainly symptom and laboratory text fields) and (ii) review of the medical record for a much smaller portion of case reports. For example, in the case of anaphylaxis, which was investigated in the current study, the PT and keyword search returned 6034 case reports that were reduced to 237 after manual review; the medical records (MR) for the latter portion of VAERS reports were obtained and reviewed resulting in 100 confirmed anaphylaxis cases.
Figure 2
Figure 2
An example of text mining processes for a case report for anaphylaxis using either the dictionary (left branch of the diagram) or the lexicon and the grammar rules (right branch of the diagram). The output of each case report is a vector of lemmas (type I vector), a vector of low-level patterns (type II vector), or a set of high-level patterns. The two types of vectors are extended by one position to include the class label for the report. The rule-based classifier classifies this report as potentially positive based on the identification of a high-level pattern (‘class’=1, ie, potentially positive). GI, gastrointestinal; MCDV, major cardiovascular; MDERM, major dermatologic; mDERM, minor dermatologic; MRESP, major respiratory; mRESP, minor respiratory.

References

    1. Sinha A, Hripcsak G, Markatou M. Large datasets in biomedicine: a discussion of salient analytic issues. J Am Med Inform Assoc 2009;16:759–67 - PMC - PubMed
    1. Singleton JA, Lloyd JC, Mootrey GT, et al. An overview of the vaccine adverse event reporting system (VAERS) as a surveillance system. Vaccine 1999;17:2908–17 - PubMed
    1. Ambert KH, Cohen AM. A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection. J Am Med Inform Assoc 2009;16:590–5 - PMC - PubMed
    1. Cohen AM. Five-way smoking status classification using text hot-spot identification and error-correcting output codes. J Am Med Inform Assoc 2008;15:32–5 - PMC - PubMed
    1. Conway M, Doan S, Kawazoe A, et al. Classifying disease outbreak reports using n-grams and semantic features. Int J Med Inform 2009;78:e47–58 - PubMed

Publication types

MeSH terms