Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Nov 24:10:385.
doi: 10.1186/1471-2105-10-385.

Automated vocabulary discovery for geo-parsing online epidemic intelligence

Affiliations

Automated vocabulary discovery for geo-parsing online epidemic intelligence

Mikaela Keller et al. BMC Bioinformatics. .

Abstract

Background: Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human.

Results: Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon.

Conclusion: The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of training data. An illustration of the word-withholding strategy.
Figure 2
Figure 2
Words sparse representation. Each word in the dictionary, its part-of-speech in the text and its capitalization status are associated with 3 indexes in the representation. If a word is out of the dictionary, it will only be represented by its part-of-speech and capitalization status.
Figure 3
Figure 3
Percentage of artificial out-of-vocabulary words. Percentage of "hidden" words when reducing the dictionary size according to the minimum frequency thresholds λ. The first bar at each λ value shows the number of out-of-vocabulary words among the words of the corpus, and the second bar shows the number of location words outside the vocabulary among the words tagged as location references using the HealthMap gazetteer. Between brackets, the dictionary size corresponding to λ is reported.
Figure 4
Figure 4
Percentage of natural out-of-vocabulary words. Percentage of unique words from a separated evaluation set (500 alerts, 11,184 unique words) that are inside or outside of the training set extracted dictionary, for training sets T0 (1,000 alerts), T1 (2,500 alerts) and T2 (5,000 alerts). The percentage of location words is computed with respect to the locations found by the commercial geo-parser (see sect.).
Figure 5
Figure 5
Evaluation with respect to HealthMap gazetteer tags. F1 score with respect to HealthMap gazetteer tags for several values of λ (red plain line). F1 scores among words with visible lexical index (green dashed line) and among words with hidden dictionary's index (blue pointed-dashed line)
Figure 6
Figure 6
Evaluation with respect to MetaCarta tags. Precision, recall and F1-score with respect to MetaCarta labels for increasing dictionary cut-offs according to the λ threshold. Performances of models trained on T0 (1,000 alerts), T1 (2,500 alerts), T2 (5,000 alerts) and T1 with location and disease targets.
Figure 7
Figure 7
Illustration of the neural network architecture. An illustration of geo-parsing neural network with a typical input.

References

    1. Mawudeku A, Lemay R, Werker D, Andraghetti R, John RS. In: Infectious Disease Surveillance. M'ikanatha N, Lynfield R, Beneden CV, de Valk H, editor. Blackwell Publishing, MA; 2007. The Global Public Health Intelligence Network.
    1. Brownstein JS, Freifeld CC. HealthMap: the development of automated real-time internet surveillance for epidemic intelligence. Euro Surveill. 2007;12(48):3322. - PubMed
    1. Freifeld CC, Mandl KD, Reis BY, Brownstein JS. HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports. J Am Med Inform Assoc. 2008;15(2):150–157. - PMC - PubMed
    1. Holden C. Netwatch: Diseases on the move. Science. 2006;314(5804):1363d.
    1. Larkin M. Technology and public health: Healthmap tracks global diseases. Lancet Infect Dis. 2007;7:91.

Publication types

LinkOut - more resources