. 2022 Apr 2;9(1):8.

doi: 10.1186/s40708-022-00156-6.

Hierarchical intrinsically motivated agent planning behavior with dreaming in grid environments

Evgenii Dzhivelikian¹, Artem Latyshev¹, Petr Kuderov^{2

3

4}, Aleksandr I Panov^{1

5

6}

Affiliations

¹ Moscow Institute of Physics and Technology, Dolgoprudny, Russia.
² Moscow Institute of Physics and Technology, Dolgoprudny, Russia. kuderov.pv@phystech.edu.
³ Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, Moscow, Russia. kuderov.pv@phystech.edu.
⁴ Artificial Intelligence Research Institute (AIRI), Moscow, Russia. kuderov.pv@phystech.edu.
⁵ Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, Moscow, Russia.
⁶ Artificial Intelligence Research Institute (AIRI), Moscow, Russia.

PMID: 35366128
PMCID: PMC8976870
DOI: 10.1186/s40708-022-00156-6

Hierarchical intrinsically motivated agent planning behavior with dreaming in grid environments

Evgenii Dzhivelikian et al. Brain Inform. 2022.

. 2022 Apr 2;9(1):8.

doi: 10.1186/s40708-022-00156-6.

Authors

Evgenii Dzhivelikian¹, Artem Latyshev¹, Petr Kuderov^{2

3

4}, Aleksandr I Panov^{1

5

6}

Affiliations

¹ Moscow Institute of Physics and Technology, Dolgoprudny, Russia.
² Moscow Institute of Physics and Technology, Dolgoprudny, Russia. kuderov.pv@phystech.edu.
³ Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, Moscow, Russia. kuderov.pv@phystech.edu.
⁴ Artificial Intelligence Research Institute (AIRI), Moscow, Russia. kuderov.pv@phystech.edu.
⁵ Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, Moscow, Russia.
⁶ Artificial Intelligence Research Institute (AIRI), Moscow, Russia.

PMID: 35366128
PMCID: PMC8976870
DOI: 10.1186/s40708-022-00156-6

Abstract

Biologically plausible models of learning may provide a crucial insight for building autonomous intelligent agents capable of performing a wide range of tasks. In this work, we propose a hierarchical model of an agent operating in an unfamiliar environment driven by a reinforcement signal. We use temporal memory to learn sparse distributed representation of state-actions and the basal ganglia model to learn effective action policy on different levels of abstraction. The learned model of the environment is utilized to generate an intrinsic motivation signal, which drives the agent in the absence of the extrinsic signal, and through acting in imagination, which we call dreaming. We demonstrate that the proposed architecture enables an agent to effectively reach goals in grid environments.

Keywords: Hierarchical temporal memory; Intrinsic motivation; Model-based reinforcement learning; Sparse distributed representations.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Hierarchical temporal memory framework. A HTM neuron. B A group of neurons organized into a minicolumn. Neurons within a minicolumn share the same receptive field. C A group of minicolumns organized into a layer. Columns within a layer share the same feedforward input, however, they may have different receptive fields

**Fig. 2**
The scheme of the selection circuit. Blocks represent corresponding biological objects: GPi—the globus pallidus internal segment; GPe—the globus pallidus external segment; D1, D2—the dopamine receptors of striatal projection neurons; triangle arrows—excitatory connections; circle arrows—inhibitory connections; double triangle arrow—dopamine connections

**Fig. 3**
HIMA with hierarchy of two levels and *Block 2* as an output block

**Fig. 4**
The scheme of visit statistics evaluation. After n prediction TM steps the vector of column visits $ν_{n}$ is received. This vector is masked by clusters representations that gives ${\hat{ν}}_{n}$ . This vector represents the number of visits for each cluster after n steps from $s_{t}$

**Fig. 5**
The scheme of the Pattern Memory update process. SP, Spatial Pooler, encodes raw input data to suitable state representation $s_{t}$ (with dimension $k_{in}$ ). Clusters are the set of c cluster representation SDRs f with their density $χ_{f}$ . The main idea is based on comparison of $s_{t}$ and f to associate $s_{t}$ for suitable cluster

**Fig. 6**
An example of observation and its binary representation. Observation has several channels. Each channel is represented by a binary mask for positions of corresponding objects

**Fig. 7**
Comparison of agents with abstract and elementary actions on crossing corridors maze

**Fig. 8**
Examples of environments. Yellow—set of initial agent positions. Green—the initial goal position. Dark blue—obstacles. Shades of light blue—floor colors

**Fig. 9**
Comparison of agents with abstract and elementary actions on the four-room maze with a restricted set of the initial agent’s positions

**Fig. 10**
Comparison of agents with abstract and elementary actions on the four-room maze without restrictions on the agent’s initial state set

**Fig. 11**
Ideal empowerment fields. All values are in the same limits and can be compared with each other. Darker color—lower value, lighter—higher value. The walls are not shown

**Fig. 12**
Clusters in four rooms. On the left part is the mapping between the number of cluster and corresponding state in the environment. On the right one is the similarity matrix. The rows and column are the indexes of clusters. The similarity value is shown by color

**Fig. 13**
Ideal and estimated empowerment fields in the same value range. Ideal case: uses true transition model. Restriction ideal: the same, but transitions by different actions to the same state are considered as one way. TM mode: TM with mode statistics for visit estimation. TM median: the same with median statistics. Walls are not shown

**Fig. 14**
Comparison of different dreaming switching strategies in four rooms with fixed positions experiments. Left: TD error-based switching strategy (green) does not add to performance of the baseline with no dreaming (red). Right: anomaly-based dreaming (red) shows a significant improvement over the baseline with no dreaming (green). It performs similarly to the baseline with the 50% increased learning rate (light blue) and converges twice faster than the baseline with the 25% reduced learning rate (blue, results are x2 shrunk along the X-axis)

**Fig. 15**
Examples of tasks for different levels. Yellow—the initial agent’s position. Green—initial goal position. Dark blue—obstacles. Shades of light blue—floor colors

**Fig. 16**
Comparison of agents with abstract and elementary actions in the exhaustible resource experiment

**Fig. 17**
Comparison of agents with abstract and elementary actions in the exhaustible resource experiment on tasks from the hard set

**Fig. 18**
Examples of four options used during the exhaustible resource experiment. The heat map visualizes a number of times the transition to a state was predicted during the execution of the corresponding option. Two small heat maps for each option: I is a probability to initialize an option in the corresponding state and $β$ —terminate probability

**Fig. 19**
Comparison of agents with different intrinsic modulation signals at Exhaustible resource task. Baseline—the agent without any intrinsic modulation. Agents with prefix “positive” have intrinsic signal from [0, 1] interval. Agents with prefix “negative”—from $[- 1, 0]$ interval. The anomaly is simple prediction error for TM. Random is value from uniform distribution. Empowerment is the ideal four-step empowerment

**Fig. 20**
Comparison of agents with a different intrinsic modulation setting in the exhaustible resource experiment. Baseline—the agent without any intrinsic modulation. Negative-empowerment—the agent with intrinsic modulation, where the ideal four-step empowerment is the intrinsic reward (values are shifted to $[- 1, 0]$ ). Positive-empowerment—the same, but the intrinsic reward is shifted to [0, 1]. Zero-const—the same, but the intrinsic reward equals zero

**Fig. 21**
Comparison of an agent with dreaming enabled (dreamer) against the baseline without dreaming in the exhaustible resource experiment

**Fig. 22**
Comparison of full-featured HIMA with a BGT only baseline

**Fig. 23**
Comparison of a full-featured HIMA with a BGT only baseline and with HIMA without one of the components in the exhaustible resource experiment

**Fig. 24**
Comparison of baseline (yellow) and full-featured (red) HIMA with DeepRL baselines: DQN (light blue) and Option-Critic (blue)

See this image and copyright information in PMC

References

1. Ahmad S, Hawkins J (2015) Properties of sparse distributed representations and their application to hierarchical temporal memory. arXiv: 1503.07469
1. Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W (2017) Hindsight experience replay. In: Advances in neural information processing systems, 30
1. Antonio Becerra J, Romero A, Bellas F, Duro RJ. Motivational engine and long-term memory coupling within a cognitive architecture for lifelong open-ended learning. Neurocomputing. 2021;452:341–354. doi: 10.1016/j.neucom.2019.10.124. - DOI
1. Asada M, MacDorman KF, Ishiguro H, Kuniyoshi Y. Cognitive developmental robotics as a new paradigm for the design of humanoid robots. Robot Auton Syst. 2001;37(2–3):185–193. doi: 10.1016/S0921-8890(01)00157-9. - DOI
1. Bacon P-L, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. AAAI’17, AAAI Press, pp 1726–34

Grants and funding

18-29-22047/RFBR

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hierarchical intrinsically motivated agent planning behavior with dreaming in grid environments

Affiliations

Hierarchical intrinsically motivated agent planning behavior with dreaming in grid environments

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous