. 2021 Sep 24:2021:7588221.

doi: 10.1155/2021/7588221. eCollection 2021.

Efficient Reinforcement Learning from Demonstration via Bayesian Network-Based Knowledge Extraction

Yichuan Zhang¹, Yixing Lan¹, Qiang Fang¹, Xin Xu¹, Junxiang Li¹, Yujun Zeng¹

Affiliations

PMID: 34603434
PMCID: PMC8486502
DOI: 10.1155/2021/7588221

Efficient Reinforcement Learning from Demonstration via Bayesian Network-Based Knowledge Extraction

Yichuan Zhang et al. Comput Intell Neurosci. 2021.

. 2021 Sep 24:2021:7588221.

doi: 10.1155/2021/7588221. eCollection 2021.

Authors

Yichuan Zhang¹, Yixing Lan¹, Qiang Fang¹, Xin Xu¹, Junxiang Li¹, Yujun Zeng¹

Affiliation

¹ College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China.

PMID: 34603434
PMCID: PMC8486502
DOI: 10.1155/2021/7588221

Abstract

Reinforcement learning from demonstration (RLfD) is considered to be a promising approach to improve reinforcement learning (RL) by leveraging expert demonstrations as the additional decision-making guidance. However, most existing RLfD methods only regard demonstrations as low-level knowledge instances under a certain task. Demonstrations are generally used to either provide additional rewards or pretrain the neural network-based RL policy in a supervised manner, usually resulting in poor generalization capability and weak robustness performance. Considering that human knowledge is not only interpretable but also suitable for generalization, we propose to exploit the potential of demonstrations by extracting knowledge from them via Bayesian networks and develop a novel RLfD method called Reinforcement Learning from demonstration via Bayesian Network-based Knowledge (RLBNK). The proposed RLBNK method takes advantage of node influence with the Wasserstein distance metric (NIW) algorithm to obtain abstract concepts from demonstrations and then a Bayesian network conducts knowledge learning and inference based on the abstract data set, which will yield the coarse policy with corresponding confidence. Once the coarse policy's confidence is low, another RL-based refine module will further optimize and fine-tune the policy to form a (near) optimal hybrid policy. Experimental results show that the proposed RLBNK method improves the learning efficiency of corresponding baseline RL algorithms under both normal and sparse reward settings. Furthermore, we demonstrate that our RLBNK method delivers better generalization capability and robustness than baseline methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflicts of interest.

Figures

**Figure 1**
The standard reinforcement learning setup.

**Figure 2**
Structure of the Bayesian network defined for the CartPole task.

**Figure 3**
An example for the NIW value calculation.

**Figure 4**
The probabilistic knowledge extraction process. The parameter estimating of the Bayesian network is based on the 𝒟_abstract^E abstracted from the original dataset 𝒟^E via the NIW algorithm.

**Figure 5**
Architecture of the RLBNK-switch method. The knowledge module represented by the Bayesian network is combined with neural network-based refine module according to the decision confidence p over the current state.

**Figure 6**
The benchmark tasks used in this paper. The CartPole task is from the OpenAI Gym environment, and the Catcher and FlappyBird tasks are from the PLE environment. (a) CartPole. (b) Catcher. (c) FlappyBird.

**Figure 7**
Comparison of RLBNK-switch and RLBNK-concat to the baseline PPO, DQfD, expert policy, and pure imitation learning under the normal reward setting. Plots show the training performance over the number of episodes. (a) CartPole. (b) Catcher. (c) FlappyBird.

**Figure 8**
Experimental results for CartPole task under different sparse reward settings, where T denotes the sparse interval of receiving rewards for the agent. Plots show the training performance over the number of episodes. (a) T = 25. (b) T = 50. (c) T = 100.

**Figure 9**
Comparison of RLBNK-switch and RLBNK-concat to the PPO-finetune, baseline PPO, DQfD, and imitation learning in two generalization settings. Plots show the training performance over the number of episodes. (a) Pole length generalization. (b) Cart mass generalization.

**Figure 10**
The cumulative reward (mean ± standard deviation with 500 rollouts) of RLBNK-switch and RLBNK-concat trained policies versus the trained PPO baseline policy when tested in disturbed CartPole task. Plots show the performance of each policy over the disturbance strength Φ.

**Algorithm 1**
Probabilistic knowledge extraction via Bayesian networks.

**Algorithm 2**
Pseudocode of the RLBNK method.

See this image and copyright information in PMC

Cited by

An Ensemble Learning Method Based on an Evidential Reasoning Rule considering Combination Weighting.
Xu C, Zhang Y, Zhang W, Zu H, Zhang Y, He W. Xu C, et al. Comput Intell Neurosci. 2022 Mar 7;2022:1156748. doi: 10.1155/2022/1156748. eCollection 2022. Comput Intell Neurosci. 2022. PMID: 35295274 Free PMC article.
An Efficient Data Classification Decision Based on Multimodel Deep Learning.
Hu W, Liu F, Peng J. Hu W, et al. Comput Intell Neurosci. 2022 May 4;2022:7636705. doi: 10.1155/2022/7636705. eCollection 2022. Comput Intell Neurosci. 2022. PMID: 35571693 Free PMC article.

References

1. Mnih V., Kavukcuoglu K., Silver D., et al. Human-level control through deep reinforcement learning. Nature . 2015;518(7540):529–533. doi: 10.1038/nature14236. - DOI - PubMed
1. Carta S., Corriga A., Ferreira A., Podda A. S., Recupero D. R. A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning. Applied Intelligence . 2021;51(2):889–905. doi: 10.1007/s10489-020-01839-5. - DOI
1. Zhao X., Zhang L., Ding Z., Xia L., Tang J., Yin D. Recommendations with negative feedback via pairwise deep reinforcement learning. In: Guo Y., Farooq F., editors. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018; August 2018; London, UK. ACM; pp. 1040–1048. - DOI
1. Schaal S. Learning from demonstration. In: Mozer M., Jordan M. I., Petsche T., editors. Proceedings of the Advances in Neural Information Processing Systems 9, NIPS; December 1996; Denver, CO, USA. MIT Press; pp. 1040–1046.
1. Vecerík M., Hester T., Scholz J., et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. 2017. http://arxiv.org/abs/1707.08817 .

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient Reinforcement Learning from Demonstration via Bayesian Network-Based Knowledge Extraction

Affiliation

Efficient Reinforcement Learning from Demonstration via Bayesian Network-Based Knowledge Extraction

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources