Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun 14:7:105.
doi: 10.3389/fnins.2013.00105. eCollection 2013.

Making predictions in a changing world-inference, uncertainty, and learning

Affiliations

Making predictions in a changing world-inference, uncertainty, and learning

Jill X O'Reilly. Front Neurosci. .

Abstract

To function effectively, brains need to make predictions about their environment based on past experience, i.e., they need to learn about their environment. The algorithms by which learning occurs are of interest to neuroscientists, both in their own right (because they exist in the brain) and as a tool to model participants' incomplete knowledge of task parameters and hence, to better understand their behavior. This review focusses on a particular challenge for learning algorithms-how to match the rate at which they learn to the rate of change in the environment, so that they use as much observed data as possible whilst disregarding irrelevant, old observations. To do this algorithms must evaluate whether the environment is changing. We discuss the concepts of likelihood, priors and transition functions, and how these relate to change detection. We review expected and estimation uncertainty, and how these relate to change detection and learning rate. Finally, we consider the neural correlates of uncertainty and learning. We argue that the neural correlates of uncertainty bear a resemblance to neural systems that are active when agents actively explore their environments, suggesting that the mechanisms by which the rate of learning is set may be subject to top down control (in circumstances when agents actively seek new information) as well as bottom up control (by observations that imply change in the environment).

Keywords: bayes theorem; change detection; exploratory behavior; learning; modeling; uncertainty.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Algorithms with a fixed temporal discount do not fit well to environments with a variable rate of change. The right-hand panels illustrate an environment in which observations are drawn from a Gaussian distribution; each row shows a different learning algorithm's estimate of the distribution mean μ. The mean μ, which has period of stability interspersed with sudden change, is shown in black. Actual observations x are shown in gray. Estimates of μ are shown in blue. The top three rows are kernel-based learning algorithms with different time constants. The left hand panels illustrate the three weighting functions (kernels) which were used to determine the weighting of observations in the panels next to them. The weighting w(j) assigned to observation ij when calculating the mean μ (i) on observation i is defined by the exponential function w(i) = exp(−j/n). The rate of decay is determined by the constant n, with higher values of n meaning a longer period of the past is used. The top row shows a kernel using only very recent observations. This tracks the mean μ well, but jumps around a lot with individual observations. Note the blue line tracks the gray (data) line more closely than it tracks the actual mean μ (black line). The 2nd and 3rd rows show kernels using longer periods of the past. This gives a much smoother estimate, but is slow to adjust to changes in μ. The bottom row shows the output of a Bayesian learning algorithm that includes an additional level of processing in order to detect change points. Note how unlike the kernel-based algorithms, its estimate is stable during periods of stability and changes rapidly in response to change in the underlying distribution.
Figure 2
Figure 2
Relationship between the concepts of Expected Uncertainty and Likelihood. Plot of values of some observed variable x against their probability, given two Gaussian distributions with the same mean. The red distribution has a lower variance, and hence lower expected uncertainty, than the blue distribution. Points a and b represent possible observed values of x. For the red and blue distributions, the distance from the mean (a − μ) is the same, but at a, the red distribution has higher likelihood (because point a has a higher probability under the red distribution than the blue distribution) whilst at point b, the blue distribution has a higher likelihood. Consider an algorithm assessing evidence that the environment has changed. If a datapoint x = b is observed, whether the algorithm infers that there has been a change will depend on the variance or expected uncertainty of the putative pre-change distribution. If the algorithm “thinks” that the red distribution is in force, an observation x = b is relatively strong evidence for a change in the environment (as b is unlikely under the red distribution) but if the algorithm “thinks” the blue distribution is in force, the evidence for change is much weaker, since point b is not so unlikely under the blue distribution as it is under the red distribution.
Figure 3
Figure 3
Illustration of estimation uncertainty. These plots show the output of a numerical Bayesian estimation of the parameters of a Gaussian distribution. If x ~ (μ, σ2), and some values of x are observed, the likelihood of different values for μ, σ2 can be calculated jointly using Bayes' rule. The colored plots (left) show the joint likelihood for different pairs of values μ, σ2, where each point on the colored image is a possible pair of values μ, σ2, and the color represents the likelihood of that pair of values. The line plots (Right panel) show the distribution across x implied by different values of μ, σ2. The dashed black line is the true distribution from which data were drawn. The blue line is the maximum a-posteriori distribution—a Gaussian distribution with values of μ, σ2 taken from the peak of the joint distribution over μ, σ2 shown on the left. The red line represents a weighted sum (W.S.) of the Gaussian distributions represented by all possible values of μ, σ2, weighted by their joint likelihood as shown in the figure to the left. The top represents an estimate of the environment based on fewer data points than the bottom row. With relatively few data points, there is a lot of uncertainty about the values of μ, σ2, i.e., estimation uncertainty—illustrated by the broader distribution of likelihood over different possible values of μ, σ2 (Left panel) in the top than bottom row. Whilst the maximum a-posteriori distribution is a good fit to the “true” distribution from which data were drawn in both cases, if we look at the weighted sum of all distributions, there is a lot more uncertainty for the top row case, based on fewer data points. Hence if the observer uses a weighted sum of all possible values of μ, σ2 of the environment to calculate a probability distribution over x, the variance of that distribution depends on the level of estimation uncertainty.
Figure 4
Figure 4
Two considerations for evaluating whether a change has occurred. Plots show the probability of observing some value of x, given that x ~ (μ, σ2) and the values of μ, σ2 can jump to new, unpredicted values as defined in Equation 2. When an observation of the environment is made, an algorithm that aims to determine whether a change has occurred should consider both the likelihood of the previous model of the environment given the new data, and the prior probability of change as determined in part by the transition function. Top panel: the probability of an observation taking a value x is shown in terms of two distributions. A Gaussian shown in blue represents the probability density across x if the most likely state of the environment (the most likely values of μ, σ2), given past data, were still in force. The uniform distribution in red represents the probability density across x arising from all the possible new states of the environment, if a change occurred. The possible new states are represented by a uniform function (red line in the figure) because, if we consider the probability of each value of x under an infinite number of possible states at once (i.e., the value of x given each of infinitely many other possible values of μ and σ2), the outcome is a uniform distribution over x. A change should be inferred if an observation occurs in the gray shaded regions—where the probability of x under the uniform (representing change) is higher than the probability under the prior Gaussian distribution. Hence the red data point in Figure 4 should cause the system to infer a change has occurred, whereas the blue data point should not. Bottom panel: as above, the probability distribution over x is a combination of a Gaussian and a Uniform distribution (representing the most likely parameters of the environment if there has been no change, and the possible new states of the environment if there has been a change, respectively). In this panel, the Gaussian and Uniform components are summed to give a single line representing the distribution over x. The different colored lines represent different prior probabilities of change, and hence different relative weightings of the Gaussian and uniform components. Increasing the prior probability of change results in a wider distribution of probability density across all possible values of x.
Figure 5
Figure 5
Bayesian learner estimates the mean and variance of a Gaussian distribution. (A) Data and maximum likelihood estimates for 200 trials. The actual mean and variance of the distribution from which the data were drawn (generative distribution) are shown in gray. The gray line is the mean and the shaded area is mean ± standard deviation. The model's estimates of these parameters are shown superposed on this, in blue. The actual data point on which the model was trained are shown as black dots. The scale on the y-axis is arbitrary. (B) The probability density function across parameter space (for plotting conventions, see Figure 3) for the first 100 trials. Each parameter-space map represents one trial; trials are shown in rows with the first trial number in each row indicated to the left of the row. Possible values of μi are plotted on the y-axis; possible values of σi are plotted on the x-axis. Colors indicate the joint posterior probability for each pair, mu, sigma, after observing data point xi. Increasing values of sigma are plotted from left to right; increasing values of μi are plotted from top to bottom. Hence, for example on trial 10 (top right) the model thinks μi is low, and σi is high. Some interesting sequences of trials are highlighted in Figures 6, 7.
Figure 6
Figure 6
Learning is faster when expected uncertainty is low. Panels (A) and (B) show two sets of trials which include changes of similar magnitude in the mean of the generative distribution (distribution from which data were in fact drawn). In panel (A), the estimate of σi is high (high expected uncertainty) but in panel (B), the estimate of σi is lower—this is indicated by the distribution of probability density from left to right in the colored parameter-space maps, and also the width of the shaded area μ ± σ on the lower plot. The red boxes indicate the set of trials shown in the parameter space maps; the red arrow shows which parameter space map corresponds to the first trial after the change point. Note that the distribution of probability in parameter space changes more slowly when expected uncertainty is high (panel A), indicating that learning is slower in this case.
Figure 7
Figure 7
Change in the environment increases estimation uncertainty. Here we see a set of trials during which a change point occurs (change point indicated by red arrow). Before the change point, the model has low estimation uncertainty (probability density is very concentrated in a small part of parameter space, as seen from the first three parameter space maps). When the change point is detected, estimation uncertainty increases as the model initially has only one data point on which to base its estimate of the new parameters of the distribution. Over the next few trials, estimation uncertainty decreases (probability density becomes concentrated in a smaller part of parameter space again).

References

    1. Aston-Jones G., Cohen J. D. (2005). An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450 10.1146/annurev.neuro.28.061604.135709 - DOI - PubMed
    1. Behrens T. E., Woolrich M. W., Walton M. E., Rushworth M. F. (2007). Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 10.1038/nn1954 - DOI - PubMed
    1. Bogacz R., Brown E., Moehlis J., Holmes P., Cohen J. D. (2006). The physics of optimal decision making: a formal analysis of models of performance in two-alternative forced-choice tasks. Psychol. Rev. 113, 700–765 10.1037/0033-295X.113.4.700 - DOI - PubMed
    1. Bouret S., Sara S. J. (2005). Network reset: a simplified overarching theory of locus coeruleus noradrenaline function. Trends Neurosci. 28, 574–582 10.1016/j.tins.2005.09.002 - DOI - PubMed
    1. Cohen M. X., Ranganath C. (2007). Reinforcement learning signals predict future decisions. J. Neurosci. 27, 371–378 10.1523/JNEUROSCI.4421-06.2007 - DOI - PMC - PubMed

LinkOut - more resources