Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Nov:118:42-64.
doi: 10.1016/j.neubiorev.2020.07.005. Epub 2020 Jul 17.

Generative models, linguistic communication and active inference

Affiliations
Review

Generative models, linguistic communication and active inference

Karl J Friston et al. Neurosci Biobehav Rev. 2020 Nov.

Abstract

This paper presents a biologically plausible generative model and inference scheme that is capable of simulating communication between synthetic subjects who talk to each other. Building on active inference formulations of dyadic interactions, we simulate linguistic exchange to explore generative models that support dialogues. These models employ high-order interactions among abstract (discrete) states in deep (hierarchical) models. The sequential nature of language processing mandates generative models with a particular factorial structure-necessary to accommodate the rich combinatorics of language. We illustrate linguistic communication by simulating a synthetic subject who can play the 'Twenty Questions' game. In this game, synthetic subjects take the role of the questioner or answerer, using the same generative model. This simulation setup is used to illustrate some key architectural points and demonstrate that many behavioural and neurophysiological correlates of linguistic communication emerge under variational (marginal) message passing, given the right kind of generative model. For example, we show that theta-gamma coupling is an emergent property of belief updating, when listening to another.

Keywords: Bayesian; Connectivity; Free energy; Hierarchical; Inference; Language; Message passing; Neuronal.

PubMed Disclaimer

Conflict of interest statement

None.

Figures

Fig. 1
Fig. 1
Active inference and Markov blankets. This figure illustrates the conditional dependencies among various states that constitute (active) inference about external states of affairs in the world. Active inference rests upon a four-way partition of states into external states (s) and internal states (s, π) that are separated by Markov blanket states (o, u). Technically, the Markov blanket of internal states comprises their parents, their children and the parents of the children. In this figure, blanket states correspond to the pale blue circles. Blanket states comprise observations or outcomes (o) and action (u). The upper panel illustrates the standard way in which conditional dependencies are mediated: internal states are treated as encoding representations of external states. These representations prescribe action on external states, which generates outcomes. In this construction, internal states play the role of sufficient statistics or parameters of a posterior belief (Q) about external states and plans or policies that are realised by action. These beliefs are optimised by minimising a free energy functional of posterior beliefs, given outcomes. Posterior beliefs about the policies provide a probability distribution from which the next action is sampled. This action changes external states, which generate outcomes – and so the (perception-action) cycle continues. The lower panel shows the simplified scheme used in this paper, labelled ‘Diachronic inference’. In this setting, actions (u) and outcomes (o) are assumed to be isomorphic. In other words, I act by generating an outcome that minimises free energy. This is equivalent to generating or selecting outcomes that are the most likely under my beliefs about the causes of that outcome. Because these outcomes are shared between two (or more) agents, they constitute the Markov blanket that separates the internal states of every agent in the exchange. This means the internal states of one agent now constitute the external states of another (and vice versa). Crucially, this rests upon a diachronic switching, in which only one agent generates outcomes at any one time. Heuristically, this means that I can either listen or speak but not both at once. With this particular constraint on conditional dependencies, the shared outcome is (e.g., spoken words) constitute the blanket states that are shared by all agents. The superscripts in the lower panel denote two agents (i and j). The equations express the sampling of various states, or their minimisation with respect to variational free energy. An interesting aspect of the diachronic setup is that everything minimises a free energy; effectively resolving uncertainty; such that the beliefs of one agent are installed in another, via an exchange of outcomes.
Fig. 2
Fig. 2
A generative model for Twenty Questions: This figure provides a schematic illustration of the generative model. This schematic displays the architecture involved in generating a sequence of words that could constitute a language-like narrative. In brief, this is a hierarchical (i.e., deep) generative model formulated in terms of discrete hidden states and outcomes (here, outcomes from the lower level are single words). The architecture is deep because there are two levels of hidden states, where the higher (deeper) level unfolds slowly in time—furnishing contextual constraints on the lower level that generates a sequence of words. The higher level contains hidden factors that generate the syntax and semantic content of a sentence, which are passed to the lower-level. Each panel uses a coloured shape to describe the different states of each factor. At the higher level, transitions among narrative states (B(2)) generate a sequence of phrases that cycle in a particular order through “Prompts”, “Questions” and “Answers”, where their form depends upon interactions with other hidden states in the generative model. The form of questions has been factorised into the type of question (“Shape”, “Location”, or “Colour”) and its semiotic content. The semiotic content has three factors (noun, objective and adverb), each with two states (noun: “square” or “triangle”; adjective: “green” or “red” and adverb: “below” or “above”). Similarly, the four scenic factors correspond to beliefs about the attributes of upper and lower objects in the world; namely, their colours (green or red) and shapes (square or triangle). In this generative model, choices about the type of question and its semiotic content are policy-dependent—as illustrated by the red arrows. In other words, policies determine transitions (encoded by the B(2) matrices) among controllable states, so that question and semiotic states are selected intentionally. For example, the question generated by the thick red arrows in the figure would be: “Is a red triangle above?” The combination of these states completely determines the syntax and semantic content of a sentence at the lower level (which is encoded in the matrix D(1)). The hidden syntax states at the lower level comprise specific words, such as “Ready” and “Is”, grammar, such as “?” or “!”, and abstract representations, such as noun, adverb, and adjective. The words denoted by the abstract representations are determined by the semantic factor, which is isomorphic with the semiotic factor of the higher level. The first word of the phrase corresponds to the initial syntactic state at the lower level—which is determined by the interactions among states at the higher level, encoded by the mapping D. For example, if the narrative state is a Question, then the initial syntax state is the word “Is”, no matter which of the three question states are selected at the higher level. The B(1) matrices then determine subsequent words (illustrated by the black arrows), by specifying transitions among syntax states that do depend upon the question states at the higher level. However, if the narrative state is Answer, then the initial syntax state can be “Yes” or “No”, depending upon high order interactions among the remaining high-level states: a “Yes” will be generated when, and only when, scenic and semiotic states are congruent (e.g., if the question “Is a red triangle above?” admitted a positive response, because a red triangle is in the upper location). For clarity, some syntaxes have been omitted; for example, a “Not sure” answer. In addition, this figure omits embellishments that generate synonymous phrases (e.g., “not sure”, “can't say”, and so on). The final stage is to map states at the lower level to outcomes at each time step of the generative process. This is denoted by the likelihood mapping A(1). In the example highlighted here, the articulated word “triangle” depends upon the current syntax state being a noun and the associated content being a “triangle”. States without arrows are absorbing states; in other words, the state only transitions to itself.
Fig. 3
Fig. 3
Generative models for discrete states and outcomes. Upper left panel: These equations specify the generative model. A generative model is the joint probability of outcomes and their (latent or hidden) causes, see first equation. Usually, the model is expressed in terms of a likelihood (the probability of consequences given causes) and priors over causes. When a prior depends upon a random variable it is called an empirical prior. Here, the likelihood is specified by a matrix A, whose elements are the probability of an outcome under every combination of hidden states. The empirical priors pertain to probabilistic transitions (in the B matrix) among hidden states that can depend upon action, which is determined probabilistically by policies (sequences of actions encoded by π). The key aspect of this generative model is that policies are more probable a priori if they minimise expected free energy G, which depends upon prior preferences about outcomes or costs encoded by C. Finally, the vector D specifies the initial state. This completes the specification of the model in terms of its parameters; namely, A, B, C and D. Bayesian model inversion refers to the inverse mapping from outcomes to causes; i.e., estimating the hidden states that cause outcomes. In approximate Bayesian inference, one specifies the form of an approximate posterior distribution. This particular form in this paper uses a mean field approximation, in which posterior beliefs are approximated by the product of marginal distributions over time points. Subscripts index time (or policy). See Section 2 and Table 1 for a detailed explanation of the variables (italic variables represent hidden states, while bold variables indicate expectations about those states). Upper right panel: This Bayesian network represents the conditional dependencies among hidden states and how they cause outcomes. Open circles are random variables (hidden states and policies) while filled circles denote observable outcomes. Squares indicate fixed or known quantities, such as the model parameters. Lower left panel: these equalities are the belief updates mediating approximate Bayesian inference and outcome selection. When the agent is responsible for generating outcomes (e.g., speaking), they are selected to minimise free energy or, in other words, maximise accuracy under posterior beliefs about the next state of the world. Lower right panel: this is an equivalent representation of the Bayesian network in terms of a Forney or normal style factor graph. Here the nodes (square boxes) correspond to factors and the edges are associated with unknown variables. Filled squares denote observable outcomes. The edges are labelled in terms of the sufficient statistics of their marginal posterior. Factors have been labelled in terms of the parameters encoding the associated probability distributions (on the upper left). The circled numbers correspond to the messages that are passed from nodes to edges (the labels are placed on the edge that carries the message from each node). The key aspect of this graph is that it discloses the messages that contribute to the posterior marginal over hidden states; here, conditioned on each policy. These constitute [forward: ❷] messages from representations of the past, [backward: ❸] messages from the future and [likelihood: ❹] messages from the outcome. Crucially, the past and future are represented at all times so that as new outcomes become available, with the passage of time, more likelihood messages participate in the message passing; thereby providing more informed (approximate) posteriors. This effectively performs online data assimilation (mediated by forwarding messages) that is informed by prior beliefs concerning future outcomes (mediated by backward messages). Please see Table 1 for a definition of the variables in this figure. Adapted with permission from (Friston et al., 2017c).
Fig. 4
Fig. 4
Deep temporal models. Left panel: This figure provides the Bayesian network and associated Forney factor graph for deep temporal models, described in terms of factors and belief updates on the left. The graphs adopt the same format as Fig. 3; however, here the model has been extended hierarchically, where (bracketed) superscripts index the hierarchical level. The key aspect of this model is its hierarchical structure that represents sequences of hidden states over time or epochs. In this model, hidden states at higher levels generate the initial states for lower levels, which unfold to generate a sequence of outcomes: c.f., associative chaining (Page and Norris, 1998). Crucially, lower levels cycle over a sequence for each transition of the level above. This is indicated by the subgraphs enclosed in dashed boxes, which are ‘reused’ as higher levels unfold. It is this scheduling that endows the model with deep temporal structure. The probability distribution over initial states is now conditioned on the state (at the current time) of the level above. Practically, this means that D now becomes a tensor, as opposed to a vector. The messages passed from the corresponding factor node rest on Bayesian model averages that require the expected policies [message ❶] and expected states under each policy. The resulting averages are then used to compose descending [message ❷] and ascending messages [message ❻] that mediate the exchange of empirical priors and posteriors between levels, respectively. Adapted with permission from (Friston et al., 2017c).
Fig. 5
Fig. 5
Factor graph for 20 questions: this schematic illustrates the message passing using a Forney style factor graph for the generative model in Fig. 2, using the format of Fig. 4. In this schematic, we have unpacked the hidden state factors, labelling those with multiple (policy-dependent) probability transition matrices in red. This graphic was produced automatically using the SPM software (please see software note).
Fig. 6
Fig. 6
Behavioural responses: Each panel shows the posterior expectations of a synthetic subject after its question had been answered. The agent’s beliefs about shape (square versus triangle) and colour (green versus red) for the upper and lower locations are depicted with large icons. Where the agent has no particular (i.e., uniform) beliefs, the two shapes are displayed overlaid and/or in grey (e.g., upper locations in panels A and B); where the agent’s beliefs tend toward a particular colour, the shape is shaded slightly red or green. The true scene (with veridical colours and shapes) is shown with small icons to the right. The question is shown in black text (above each set of expectations), while the answer is shown below. All of the answers in this simulation are correct, so they are displayed in green text. The human icons and purple callouts are positioned next to the agent’s vocalisations, to illustrate whether the subject was asking questions (first four exchanges) or answering them (last two exchanges).
Fig. 7
Fig. 7
Electrophysiological responses: This figure shows the simulated electrophysiological responses associated with the belief updating reported in Fig. 6. In this figure, we focus on beliefs about the colour of the lower object, which is at the higher level of the generative model—thus, these plots show simulated responses following each phrase (i.e., prompt, question, and answer) rather than following each word. The horizontal axes show time over the entire exchange, assuming each phrase lasts for 250 ms. Expectations about the hidden state encoding the colour of the lower object are presented in raster format in the panel A, where black corresponds to maximum firing rates. Panel B shows the same data but in a different format: here, pooled expectations (filtered between 4 and 32 Hz) are shown as a white line. This simulated local field potential is superimposed upon a time-frequency heat map to illustrate bursts of frequency-specific energy (white), during periods of belief updating. The underlying fluctuations in simulated neuronal activity, after bandpass filtering between 4 Hz and 32 Hz, are shown in panel C. Each of the coloured lines on this plot represent belief updating for a given unit (i.e., the rows of the upper panel). Panel D shows simulated dopamine responses after each answer: these attenuate as uncertainty is resolved.
Fig. 8
Fig. 8
Spectral responses and nested oscillations. This figure shows the spectral responses associated with the simulated electrophysiological responses in Fig. 7. Panel A is a reproduction of Fig. 7B. Panel B reports the spectral density of the six units (i.e., ‘red’ or ‘green’ for epochs 1, 2, and 3). Only three lines are visible because pairs of responses overlap perfectly. Note that the scale is expressed in terms of log power. The matrix in panel C shows the correlation between the magnitudes of responses over frequencies ranging from 4 to 32 Hz. These correlations are based on the time frequency response in panel A.
Fig. 9
Fig. 9
Hierarchical message passing and nested oscillations. The upper panel illustrates responses at the second level using the format of the upper panel of Fig. 7. Here, we focus on representations of the colour of the upper object—following each phrase—for the last three exchanges. At this point, the agent is fairly sure the upper object is green (as indicated by the darker shading for the ‘green’ unit in the upper panel). The middle panel shows the equivalent results for representations in the lower level, encoding the semantic adjective factor, which switches between green and red for the last three questions. The lower panel shows the band-pass filtered responses (between 4 and 32 Hz) to illustrate the nesting of simulated electrophysiological responses (solid lines: higher-level scenic responses. Broken lines: lower-level semantic responses). Two responses have been highlighted for illustration in red (high level) and cyan (lower level). The nesting of (simulated) neuronal fluctuations is evident at a number of levels. First, bursts of activity are organised around periods of belief updating, when sensory evidence becomes available. Periods of activity are evoked by auditory outcomes (words) at the lower level and—at the higher level—evidence that speaks to the posterior expectations or representations. Second, it demonstrates transients at the onset of each word, which recur at a theta frequency. Each transient carries fast (e.g., gamma) frequency components. This means there is a theta-gamma coupling in the sense that the amplitude of gamma responses fluctuates at a theta frequency. Finally, note that the transients at the lower level (cyan line) are ‘sharper’ than the transients at the higher level (red line).
Fig. 10
Fig. 10
Violation responses: This figure illustrates the neurophysiological simulation of a violation response, of the sort seen in response to a semantic violation or unexpected sentence closure. We reproduced this paradigm by rerunning the fifth narrative but supplying the wrong answer at the end (see panel H). The left box (A–D) shows the standard responses when the correct answer is supplied (see panel D) using a similar format to Fig. 7. Here, the simulated unit firing of neurons that respond to the colour of the lower object (i.e., the scenic representation at the higher level) are shown in raster format (panel C). The population average or expected firing rate is used to simulate unit activity by sampling from a binomial distribution at each 16 ms time window. The average response magnitude and time frequency response are shown in panel A for the three epochs (prompt, question, answer) of the fifth exchange. The simulated event-related potentials (i.e., expectations about the colour of the lower object—red or green—at the three epochs, band pass filtered at 4–32 Hz) are shown in panel B. The right box (E–H) reproduces the same results after supplying the wrong answer (i.e., “No” versus “Yes”), which induces protracted belief updating over a longer latency, as indicated by the blue arrow.
Fig. 11
Fig. 11
Playing ‘Twenty Questions’ with a partner: These simulations use a similar format to Fig. 6; however, here there are two synthetic subjects. Their beliefs are displayed in separate columns within each panel, and the text is placed next to the agent who spoke the phrase. The second subject (purple icon, right column) has precise (i.e., confident) beliefs about the scene at hand: it believes there is a green square above a red square). In contrast, the first agent (green icon, left column) begins with imprecise beliefs and effectively inherits the beliefs of the confident subject, by listening to the answers to the questions it asks. It is then able to answer the two questions asked by the other agent in the fifth and sixth narratives. The lower panels replicate the simulation but here the less confident agent answers questions.
Fig. 12
Fig. 12
Storytelling: The result of an exchange between two synthetic agents, when the second agent (purple icon, right panel) answered its own questions for the first four exchanges (panels A–D). For the fourth and fifth exchanges (panels E–F), the second agent asked the questions and the first agent (green icon, left column) answered. Here, the first agent had to rely upon the question selected by the second agent to update its beliefs about the scene. This resulted in some residual ambiguity about the lower object (i.e., it is most likely to be a red triangle, it could be a red square, but it is probably not a green square). Nevertheless, the first subject was still able to answer the questions correctly.
Fig. 13
Fig. 13
Folie à deux: The result of an exchange between two interlocutors (green and purple), who are both unsure about the scene they are discussing. The format of this figure follows that of previous figures. The upper panels (A–F) show the questions and answers that confess a lack of knowledge or certainty. Each agent’s posterior expectations about the scene are indicated by the coloured shapes. In this simulation, neither agent informs the other agent about the objects present in the scene, and so they both remain in a state of mutually consistent ignorance. The lower panels (G–L) show the same simulation when the likelihood of an “I’m not sure” response was set to zero. This produces a folie à deux described in the main text. In brief, the ensuing belief updating starts from an unstable fixed point of uncertainty that converges onto a shared fantasy about what both agents (are confident they) believe.

References

    1. Adams R.A., Shipp S., Friston K.J. Predictions not commands: active inference in the motor system. Brain Struct. Funct. 2013;218:611–643. - PMC - PubMed
    1. Allwood J., Nivre J., Ahlsén E. On the semantics and pragmatics of linguistic feedback. J. Semant. 1992;9:1–26.
    1. Al-Muhaideb S., Menai M.E. Evolutionary computation approaches to the curriculum sequencing problem. Nat. Comput. 2011;10:891–920.
    1. Altmann G., Steedman M. 1988. Interaction with Context during Human Sentence Processing. - PubMed
    1. Arnal L.H., Giraud A.L. Cortical oscillations and sensory predictions. Trends Cogn. Sci. 2012;16:390–398. - PubMed

Publication types