Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Jul 1;7(3):301-319.
doi: 10.1023/A:1024084221803.

A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets

Affiliations

A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets

Greg Ridgeway et al. Data Min Knowl Discov. .

Abstract

Markov chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need for statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.In this article we present a method for making Bayesian analysis of massive datasets computationally feasible. The algorithm simulates from a posterior distribution that conditions on a smaller, more manageable portion of the dataset. The remainder of the dataset may be incorporated by reweighting the initial draws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in data access, it comes at the expense of estimation efficiency. A simple modification, based on the "rejuvenation" step used in particle filters for dynamic systems models, sidesteps the loss of efficiency with only a slight increase in the number of data accesses.To show proof-of-concept, we demonstrate the method on two examples. The first is a mixture of transition models that has been used to model web traffic and robotics. For this example we show that estimation efficiency is not affected while offering a 99% reduction in data accesses. The second example applies the method to Bayesian logistic regression and yields a 98% reduction in data accesses.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The Metropolis-Hastings algorithm
Figure 2
Figure 2
Importance sampling for massive datasets
Figure 3
Figure 3
Comparison of f(θ|D1, D2) and f (θ|D1)
Figure 4
Figure 4
The resample-move step. 1) generate an initial sample from f(θ|D1). The ticks mark the particles, the sampled θi. 2) Weight based on f (θ|D1, D2) and resample, the length of the vertical lines indicate the number of times resampled. 3) For each θi perform an MCMC step to diversify the sample.
Figure 5
Figure 5
Particle filtering for massive datasets
Figure 6
Figure 6
The frequency of access by observation. The marks along the x-axis refer to occurrences of the rejuvenation step.
Figure 7
Figure 7
The posterior distribution of the transition probabilities for one of the transition matrices. The density is based on the particle filter. The vertical line marks the true value used to simulate the dataset.
Figure 8
Figure 8
Frequency of access by observation for the outpic example. The inward tickmarks on the x-axis indicate rejuvenation steps.

References

    1. Andrieu C, de Freitas N, Doucet A, Jordan MI. ‘An Introduction to MCMC for Machine Learning’. Machine Learning. 2003;50(12)
    1. Besag J, Green P, Higdon D, Mengersen K. Bayesian computation and stochastic systems (with Discussion) Statistical Science. 1995;10:3–41.
    1. Cadez I, Heckerman D, Meek C, Smyth P, White S. ‘Visualization of navigation patterns on a Web site using model-based clustering’. Microsoft Research; 2000. Technical Report MSR-TR-00-18.
    1. Carlin B, Louis T. Bayes and Empirical Bayes Methods for Data Analysis. 2 Boca Raton, FL: Chapman and Hall; 2000.
    1. Chopin N. A sequential particle filter method for static models. Biometrika. 2002;89(3):539–552.