Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 16;36(3):437-474.
doi: 10.1162/neco_a_01646.

Active Learning for Discrete Latent Variable Models

Affiliations

Active Learning for Discrete Latent Variable Models

Aditi Jha et al. Neural Comput. .

Abstract

Active learning seeks to reduce the amount of data required to fit the parameters of a model, thus forming an important class of techniques in modern machine learning. However, past work on active learning has largely overlooked latent variable models, which play a vital role in neuroscience, psychology, and a variety of other engineering and scientific disciplines. Here we address this gap by proposing a novel framework for maximum-mutual-information input selection for discrete latent variable regression models. We first apply our method to a class of models known as mixtures of linear regressions (MLR). While it is well known that active learning confers no advantage for linear-gaussian regression models, we use Fisher information to show analytically that active learning can nevertheless achieve large gains for mixtures of such models, and we validate this improvement using both simulations and real-world data. We then consider a powerful class of temporally structured latent variable models given by a hidden Markov model (HMM) with generalized linear model (GLM) observations, which has recently been used to identify discrete states from animal decision-making data. We show that our method substantially reduces the amount of data needed to fit GLM-HMMs and outperforms a variety of approximate methods based on variational and amortized inference. Infomax learning for latent variable models thus offers a powerful approach for characterizing temporally structured latent states, with a wide variety of applications in neuroscience and beyond.

PubMed Disclaimer

Figures

Figure 8.
Figure 8.
Infomax learning for GLM-HMMs. (Left) The posterior entropy of model parameters over the course of 1000 trials when performing infomax learning using our Laplace-based Gibbs sampling approach with a single long chain (red), using parallel chains of our Laplace-based Gibbs sampler (violet), using Polya-Gamma augmented Gibbs sampling (peach), and using random sampling (blue). (Middle, right) Shows error in recovering the transition matrix and the weights of the GLMs using the same set of methods.
Figure 9.
Figure 9.
Gibbs sampling-based infomax learning for GLM-HMMs, with varying lengths of a single Gibbs chain. Left panel shows the posterior entropy of model parameters over the course of 1000 trial using Gibbs sampling with a single long chain. Each trace shows posterior entropy when using a different number of samples obtained from Gibbs sampling, the same samples are used to select the next input. Error bars correspond to 95% confidence interval of the mean over 5 experiments. Right panel shows error in recovering model parameters, while varying the number of samples in the Gibbs chain.
Figure 1.
Figure 1.
Discrete latent variable regression models and infomax learning. (A) Schematic of a discrete latent variable model for regression settings. The response y of the model given a stimulus x and a latent z is produced by generalized linear models. Here the discrete latent variable z determines which of the three generalized linear models at the bottom determines the input-output mapping on any trial. (B) Infomax learning for discrete latent variable models. On trial t, present an input xt to the system of interest (e.g., a mouse performing a decision-making task) and record its response yt. We assume this response depends on the stimulus (input), as well as an internal or latent state zt, as specified by the model Pytxt,zt,θ. Second, update the posterior distribution over model parameters θ given the data collected so far in the experiment, 𝒟t=x1:t,y1:t, using either MCMC sampling or variational inference. Third, select the input for the next trial that maximizes information gain or the mutual information between the next response yt+1 and the model parameters θ.
Figure 2.
Figure 2.
Infomax learning for mixture of linear regressions (MLR) models. (A) Model schematic. At time step t, the system is in state zt=k with probability πk. The system generates output yt using state-dependent weights wk and independent additive gaussian noise (see equation 6.2). (B) Example two-state model with two-dimensional weights w1=(1,0) and w2=(1,0). We consider possible inputs on the unit circle, which are the information-maximizing inputs for a linear gaussian model under an L2 norm constraint. (C) Fisher information as a function of the angle between w1 and the input presented to the system for different noise variances σ2. (D) Comparison between infomax active learning (using MCMC sampling and VI methods), DAD and random sampling for the 2D MLR model shown above with mixing probabilities π=[0.6,0.4] and noise variance σ2=0.1. Error bars reflect 95% confidence interval (standard error) of the mean across 20 experiments. (E) Performance comparison for the same two-state model but with 10-dimensional weight vectors and inputs. The possible inputs to the system were uniform samples from the 10D unit hypersphere.
Figure 3.
Figure 3.
Left: Histogram showing the inputs selected by our MCMC-infomax active learning method for 200 trials using a mixture of linear regressions (MLR) model with inputs on a 2D circle (mutual information for MLRs is symmetric along the vertical axis in Figure 2; hence, we show only inputs in the range of 0 to 180°) This shows a drop in probability at 90°, which is predicted by our analysis of Fisher information (see Figure 2C). Right: Equivalent histogram for the DAD method, which did not show the same tendency to avoid inputs at 90°. Instead, the inputs selected covered the unit circle with modes appearing at multiples of approximately 30°. We are unsure why this is the case. (Note that DAD requires a continuous range of inputs; hence, it selected inputs on the entire unit circle as opposed to a discrete set.)
Figure 4.
Figure 4.
Application of infomax learning to California housing data set (Kelley Pace & Barry, 1997). (A) Best-fitting mixing weights for three-state MLR to 5000 samples of the data set. (B) Best-fitting state weights for three state MLR to 5000 samples of the California housing data set. Orange, green, and blue represent states 1, 2, and 3, respectively. Black represents the linear regression fit. (C) BIC as number of MLR states is varied from 1 (standard linear regression) to 5. We select the three-state model as BIC begins to level off beyond three states. (D) Posterior entropy between the three state MLR parameters obtained using 5000 samples (parameters shown in panels A and B) and recovered parameters as a function of the number of samples for random sampling (blue) and MCMC-infomax sampling (red). Error bars reflect 95% confidence interval of the mean across 10 experiments. (E) The same as in panel D but for the RMSE (root mean squared error). (F) Visualization of standard deviation of 500 inputs selected by both infomax (red) and random sampling (blue). Each dot corresponds to a different experiment. An examination of panel B makes it clear that the three states differ most according to the weights placed on the AveOccup, Latitude, and Longitude covariates. All 10 infomax experiments select inputs with greater variance for the latitude and longitude covariates than are selected by the random sampling experiments.
Figure 5.
Figure 5.
Infomax for GLM-HMMs. (A) Data generation process for the GLM-HMM. At time step t, the system generates output yt based on its input xt and the latent state zt. The system then either remains in the same state or transitions into a new state at trial t+1, with the probabilities given by the entries in the transition matrix A. (B) Example settings for the transition matrix and state GLMs for a three-state GLM-HMM. These are the settings we use to generate output data for the analyses shown in panels C and D. (C) Left: Posterior entropy over the course of 1000 trials for random sampling (blue), infomax with a single GLM (gray), infomax for the full GLM-HMM using variational inference (VI) and MCMC sampling (magenta and red, respectively). Middle: Root mean squared error for the recovered transition matrix for each of the three input-selection schemes (random/infomax with GLM/infomax with GLM-HMM (MCMC)/infomax with GLM-HMM (VI)). Right: Root mean squared error for the weight vectors of the GLM-HMM for each of the input-selection schemes. (D) Selected inputs for random sampling (blue), active learning when there is model mismatch and the model used for infomax is a single GLM (gray), active learning with infomax (using MCMC sampling) and the full GLM-HMM (red). Selected inputs over the course of 1000 trials are plotted and are shown on top of the generative GLM curves.
Figure 6.
Figure 6.
Inferring latent states. (Top) The true latent states of the data-generating GLM-HMM for 100 trials. (Middle) The posterior probabilities of states using a GLM-HMM trained using infomax learning on 400 trials from the data-generating GLM-HMM. (Bottom) The same for a GLM-HMM trained using random sampling on 400 trials from the data-generating GLM-HMM.
Figure 7.
Figure 7.
Infomax learning for mixture of GLMs (MGLMs). (A) Data generation model. Example settings for a two-state MGLM along with the mixing weights for the two states. (B) Posterior entropy of model parameters over the course of 2000 trials for random sampling (blue) and infomax learning (MCMC-sampling based) for MGLM (blue). (C) Root mean squared error for the recovered GLM weights and mixing weights for each of the two input-selection schemes.

References

    1. Anderson B, & Moore A (2005). Active learning for hidden Markov models: Objective functions and algorithms. In Proceedings of the 22nd International Conference on Machine Learning (pp. 9–16).
    1. Ashwood ZC, Roy NA, Stone IR, Urai AE, Churchland AK, Pouget A, & Pillow JW (2022). Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(22), 201–212. 10.1038/s41593-021-01007-z - DOI - PMC - PubMed
    1. Bak JH, Choi J, Witten I, Akrami A, & Pillow JW (2016). Adaptive optimal training of animal behavior. In Lee D, Sugiyama M, Luxburg U, Guyon I, & Garnett R (Eds.), Advances in neural information processing systems, 29 (pp. 1939–1947). Curran.
    1. Bak JH, & Pillow JW (2018). Adaptive stimulus selection for multialternative psychometric functions with lapses. Journal of Vision, 18(12), 4. 10.1167/18.12.4 - DOI - PMC - PubMed
    1. Behboodian J (1972). Information matrix for a mixture of two normal distributions. Journal of Statistical Computation and Simulation 1(4), 295–314. 10.1080/00949657208810024 - DOI

LinkOut - more resources