Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 2;9(1):8.
doi: 10.1186/s40708-022-00156-6.

Hierarchical intrinsically motivated agent planning behavior with dreaming in grid environments

Affiliations

Hierarchical intrinsically motivated agent planning behavior with dreaming in grid environments

Evgenii Dzhivelikian et al. Brain Inform. .

Abstract

Biologically plausible models of learning may provide a crucial insight for building autonomous intelligent agents capable of performing a wide range of tasks. In this work, we propose a hierarchical model of an agent operating in an unfamiliar environment driven by a reinforcement signal. We use temporal memory to learn sparse distributed representation of state-actions and the basal ganglia model to learn effective action policy on different levels of abstraction. The learned model of the environment is utilized to generate an intrinsic motivation signal, which drives the agent in the absence of the extrinsic signal, and through acting in imagination, which we call dreaming. We demonstrate that the proposed architecture enables an agent to effectively reach goals in grid environments.

Keywords: Hierarchical temporal memory; Intrinsic motivation; Model-based reinforcement learning; Sparse distributed representations.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Hierarchical temporal memory framework. A HTM neuron. B A group of neurons organized into a minicolumn. Neurons within a minicolumn share the same receptive field. C A group of minicolumns organized into a layer. Columns within a layer share the same feedforward input, however, they may have different receptive fields
Fig. 2
Fig. 2
The scheme of the selection circuit. Blocks represent corresponding biological objects: GPi—the globus pallidus internal segment; GPe—the globus pallidus external segment; D1, D2—the dopamine receptors of striatal projection neurons; triangle arrows—excitatory connections; circle arrows—inhibitory connections; double triangle arrow—dopamine connections
Fig. 3
Fig. 3
HIMA with hierarchy of two levels and Block 2 as an output block
Fig. 4
Fig. 4
The scheme of visit statistics evaluation. After n prediction TM steps the vector of column visits νn is received. This vector is masked by clusters representations that gives ν^n. This vector represents the number of visits for each cluster after n steps from st
Fig. 5
Fig. 5
The scheme of the Pattern Memory update process. SP, Spatial Pooler, encodes raw input data to suitable state representation st (with dimension kin). Clusters are the set of c cluster representation SDRs f with their density χf. The main idea is based on comparison of st and f to associate st for suitable cluster
Fig. 6
Fig. 6
An example of observation and its binary representation. Observation has several channels. Each channel is represented by a binary mask for positions of corresponding objects
Fig. 7
Fig. 7
Comparison of agents with abstract and elementary actions on crossing corridors maze
Fig. 8
Fig. 8
Examples of environments. Yellow—set of initial agent positions. Green—the initial goal position. Dark blue—obstacles. Shades of light blue—floor colors
Fig. 9
Fig. 9
Comparison of agents with abstract and elementary actions on the four-room maze with a restricted set of the initial agent’s positions
Fig. 10
Fig. 10
Comparison of agents with abstract and elementary actions on the four-room maze without restrictions on the agent’s initial state set
Fig. 11
Fig. 11
Ideal empowerment fields. All values are in the same limits and can be compared with each other. Darker color—lower value, lighter—higher value. The walls are not shown
Fig. 12
Fig. 12
Clusters in four rooms. On the left part is the mapping between the number of cluster and corresponding state in the environment. On the right one is the similarity matrix. The rows and column are the indexes of clusters. The similarity value is shown by color
Fig. 13
Fig. 13
Ideal and estimated empowerment fields in the same value range. Ideal case: uses true transition model. Restriction ideal: the same, but transitions by different actions to the same state are considered as one way. TM mode: TM with mode statistics for visit estimation. TM median: the same with median statistics. Walls are not shown
Fig. 14
Fig. 14
Comparison of different dreaming switching strategies in four rooms with fixed positions experiments. Left: TD error-based switching strategy (green) does not add to performance of the baseline with no dreaming (red). Right: anomaly-based dreaming (red) shows a significant improvement over the baseline with no dreaming (green). It performs similarly to the baseline with the 50% increased learning rate (light blue) and converges twice faster than the baseline with the 25% reduced learning rate (blue, results are x2 shrunk along the X-axis)
Fig. 15
Fig. 15
Examples of tasks for different levels. Yellow—the initial agent’s position. Green—initial goal position. Dark blue—obstacles. Shades of light blue—floor colors
Fig. 16
Fig. 16
Comparison of agents with abstract and elementary actions in the exhaustible resource experiment
Fig. 17
Fig. 17
Comparison of agents with abstract and elementary actions in the exhaustible resource experiment on tasks from the hard set
Fig. 18
Fig. 18
Examples of four options used during the exhaustible resource experiment. The heat map visualizes a number of times the transition to a state was predicted during the execution of the corresponding option. Two small heat maps for each option: I is a probability to initialize an option in the corresponding state and β—terminate probability
Fig. 19
Fig. 19
Comparison of agents with different intrinsic modulation signals at Exhaustible resource task. Baseline—the agent without any intrinsic modulation. Agents with prefix “positive” have intrinsic signal from [0, 1] interval. Agents with prefix “negative”—from [-1,0] interval. The anomaly is simple prediction error for TM. Random is value from uniform distribution. Empowerment is the ideal four-step empowerment
Fig. 20
Fig. 20
Comparison of agents with a different intrinsic modulation setting in the exhaustible resource experiment. Baseline—the agent without any intrinsic modulation. Negative-empowerment—the agent with intrinsic modulation, where the ideal four-step empowerment is the intrinsic reward (values are shifted to [-1,0]). Positive-empowerment—the same, but the intrinsic reward is shifted to [0, 1]. Zero-const—the same, but the intrinsic reward equals zero
Fig. 21
Fig. 21
Comparison of an agent with dreaming enabled (dreamer) against the baseline without dreaming in the exhaustible resource experiment
Fig. 22
Fig. 22
Comparison of full-featured HIMA with a BGT only baseline
Fig. 23
Fig. 23
Comparison of a full-featured HIMA with a BGT only baseline and with HIMA without one of the components in the exhaustible resource experiment
Fig. 24
Fig. 24
Comparison of baseline (yellow) and full-featured (red) HIMA with DeepRL baselines: DQN (light blue) and Option-Critic (blue)

Similar articles

Cited by

References

    1. Ahmad S, Hawkins J (2015) Properties of sparse distributed representations and their application to hierarchical temporal memory. arXiv: 1503.07469
    1. Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W (2017) Hindsight experience replay. In: Advances in neural information processing systems, 30
    1. Antonio Becerra J, Romero A, Bellas F, Duro RJ. Motivational engine and long-term memory coupling within a cognitive architecture for lifelong open-ended learning. Neurocomputing. 2021;452:341–354. doi: 10.1016/j.neucom.2019.10.124. - DOI
    1. Asada M, MacDorman KF, Ishiguro H, Kuniyoshi Y. Cognitive developmental robotics as a new paradigm for the design of humanoid robots. Robot Auton Syst. 2001;37(2–3):185–193. doi: 10.1016/S0921-8890(01)00157-9. - DOI
    1. Bacon P-L, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. AAAI’17, AAAI Press, pp 1726–34

Grants and funding

LinkOut - more resources