Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;107(500):1410-1426.
doi: 10.1080/01621459.2012.713876. Epub 2012 Dec 21.

Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model

Affiliations

Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model

Vanja Dukic et al. J Am Stat Assoc. 2012.

Abstract

In this article, we use Google Flu Trends data together with a sequential surveillance model based on state-space methodology to track the evolution of an epidemic process over time. We embed a classical mathematical epidemiology model [a susceptible-exposed-infected-recovered (SEIR) model] within the state-space framework, thereby extending the SEIR dynamics to allow changes through time. The implementation of this model is based on a particle filtering algorithm, which learns about the epidemic process sequentially through time and provides updated estimated odds of a pandemic with each new surveillance data point. We show how our approach, in combination with sequential Bayes factors, can serve as an online diagnostic tool for influenza pandemic. We take a close look at the Google Flu Trends data describing the spread of flu in the United States during 2003-2009 and in nine separate U.S. states chosen to represent a wide range of health care and emergency system strengths and weaknesses. This article has online supplementary materials.

Keywords: Flu; Google correlate; Google insights; Google searches; Google trends; H1N1; IP surveillance; Infectious Diseases; Influenza; Nowcasting; Online surveillance; Particle filtering.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
An example solution to an SEIR system specified in Equation (1), in a population of size 100.
Figure 2.
Figure 2.
Google Flu Trends estimated ILI percentages (dashed line) and CDC ILI surveillance percentages (solid line) for the United States, from June 2003 until September 2009. Separate plots correspond to separate influenza years, with each new influenza season starting in autumn and ending in spring. Note that CDC did not post ILI reports in summers prior to 2009, and thus no solid line appears during summer months prior to 2009.
Figure 3.
Figure 3.
Google Flu Trends ILI surveillance in nine representative states, 2003–2009. The states were chosen to span a range of health care preparedness criteria based on the results published in the American College of Emergency Physicians 2009 Report. The states that are ranked among the best in quality of health care are Maryland, Massachusetts, and Pennsylvania. The states that ranked low in the areas of “disaster preparedness,” “emergency care access,” and “public health” include South Carolina, Oklahoma, Mississippi, South Dakota, Tennessee, and Arkansas. Note some states’ search term counts were too low to procure the Flu Trends surveillance data early on, from 2003 through 2005.
Figure 4.
Figure 4.
Normality assumption checks: the left column shows the box plots of growth rates and the right column shows the empirical (unfilled circles) and normal cumulative distribution functions (CDFs) (filled circles). The top row shows the 2003/2004 season and the bottom row shows the 2008/2009 season.
Figure 5.
Figure 5.
Flu tracking results in the United States for the 2003/2004 influenza season. In the I plot (second plot in the top row), the points represent weekly Google Flu Trends values, while the lines correspond to the lower 2.5 th percentile, median, and the upper 2.5 th percentile of the infectious state It posterior distribution as time progresses. In the other plots, the two lines present the lower and upper 2.5 th percentiles, while the points present the weekly posterior medians. The results for Bayes factors for the two competing basic reproductive ratios (1.25 vs. 2.2), under 1:1 prior odds, are presented in the last panel, with higher log-Bayes factor meaning stronger evidence in favor of seasonal epidemics.
Figure 6.
Figure 6.
Flu tracking results in the United States for the 2008/2009 influenza season. In the I plot, the points represent weekly Google Flu Trends values, while the lines correspond to the lower 2.5th percentile, median, and the upper 2.5th percentile of the infectious state It posterior distribution as time progresses. In the other plots, the two lines present the lower and upper 2.5th percentiles, while the points present the weekly posterior medians. The results for Bayes factors for the two competing basic reproductive ratios (1.25 vs 2.2), under 1:1 prior odds, are presented in the last panel, with higher log-Bayes factor meaning stronger evidence for seasonal epidemics.
Figure 7.
Figure 7.
Sensitivity analysis under two additional priors on transmission rate: the gray lines correspond to a prior with the mean of 1.4 and the black lines correspond to an “optimistic” prior with the prior β mean of 1.1. The three black and gray line sets in the left column plots correspond to the upper 97.5th percentile, posterior mean, and the 2.5 th percentile of the sequentially simulated marginal posteriors of the transmission parameter. The right column shows the log-Bayes factors, under 1:1 prior odds, with higher log Bayes factors indicating support for a regular epidemic. The top row shows the 2003/2004 season and the bottom row shows the 2008/2009 season.
Figure 8.
Figure 8.
Sensitivity analysis for the 1-week-ahead prediction under two different priors on transmission rate: the right column corresponds to a prior with the mean of 1.4 and the left column to an optimistic prior (with the prior β mean of 1.1). The three gray lines correspond to the upper 97.5th percentile, posterior mean, and the 2.5 th percentile of the sequentially simulated predictive distributions, while the black line with points corresponds to the observed data. The top row shows the 2003/2004 season and the bottom row shows the 2008/2009 season. One-week-ahead prediction shows little sensitivity to the priors.
Figure 9.
Figure 9.
Sequential posterior distributions for the state-space SEIR model (left column) and the simple AR(1) benchmark model (right column) presented in Section 3, for the 2003/2004 flu season. The top row presents results for the growth rate of the infected population and the bottom row for the infected population fraction. The black circles correspond to the observations, gray squares (with gray line) are the fitted weekly values, and gray dashed lines are the 95% pointwise credible intervals. The AR(1) model is unable to capture the structure of the process as well as the state-space SEIR model.
Figure 10.
Figure 10.
Comparison of the one-step ahead forecasts produced by the state-space SEIR model (left column) and the simple AR(1) benchmark model (right column) presented in Section 3. The top row presents results for the 2003/2004 flu season and the bottom row for the 2008/2009 flu season. The black squares correspond to the observations, gray circles (with gray line) are the predicted values (using data up to the previous week only), and gray dashed lines are the 95% pointwise credible intervals for the predictions. The AR(1) model predictions are not very accurate and reflect the inability of this simple model to capture the structure of the epidemic process well. The relative mean squared error of the AR(1) model versus the state-space SEIR model is 5.09 for the 2003/2004 season and 2.34 for the 2008/2009 season.
Figure 11.
Figure 11.
Flu tracking results in South Dakota (top row) and Oklahoma (bottom row) for the 2008/2009 influenza season. In the I plots (first plots in each row), the points represent weekly Google Flu Trends values, while the lines correspond to the lower 2.5 th percentile, median, and the upper 2.5th percentile of the posterior distribution of It as time progresses. In the other plots, the two lines present the lower and upper 2.5th percentiles, while the points present the weekly posterior medians. The log-Bayes factor results for the two competing basic reproductive ratios, a mild one (1.25) and severe one (2.2), under 1:1 prior odds, are presented in the last panel. There seems to be little evidence for a pandemic.
Figure 12.
Figure 12.
Comparison of posterior distributions between the sequential learning algorithm and MCMC, at the end of the 2003/2004 U.S. flu season. Gray histograms correspond to the marginal posterior distributions obtained via MCMC (based on 1500 samples), while the white histograms correspond to those obtained via the sequential learning algorithm (“SLA”) proposed in this article based on 1,000,000 particles

References

    1. American College of Emergency Physicians. (2009), The National Report Card on the State of Emergency Medicine, Irving, Texas: American College of Emergency Physicians. Available at: http://www.emreportcard.org/uploadedFiles/ACEP-ReportCard-10-22-08.pdf
    1. Anderson RM, and May RM (1991), Infectious Diseases of Humans: Dynamics and Control, Oxford: Oxford University Press.
    1. Arulampalam M, Maskell S, Gordon N, and Clapp T (2002), “A Tutorial on Particle Filters for On-Line Nonlinear/Non-Gaussian Bayesian Tracking,” IEEE Transactions on Signal Processing, 50, 174–188.
    1. Atkinson KE (1978), Introduction to Numerical Analysis, New York: Wiley.
    1. Cappé O, Godsill S, and Moulines E (2007), “An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo,” IEEE Proceedings in Signal Processing, 95, 899–924.

LinkOut - more resources