Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Sherif Abdelfattah¹, Kathryn Kasmarik¹, Jiankun Hu¹

Affiliations

PMID: 30356836
PMCID: PMC6189603
DOI: 10.3389/fnbot.2018.00065

Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Sherif Abdelfattah et al. Front Neurorobot. 2018.

. 2018 Oct 9:12:65.

doi: 10.3389/fnbot.2018.00065. eCollection 2018.

Authors

Sherif Abdelfattah¹, Kathryn Kasmarik¹, Jiankun Hu¹

Affiliation

¹ School of Engineering and Information Technology, University of New South Wales, Canberra, ACT, Australia.

PMID: 30356836
PMCID: PMC6189603
DOI: 10.3389/fnbot.2018.00065

Abstract

Many real-world decision-making problems involve multiple conflicting objectives that can not be optimized simultaneously without a compromise. Such problems are known as multi-objective Markov decision processes and they constitute a significant challenge for conventional single-objective reinforcement learning methods, especially when an optimal compromise cannot be determined beforehand. Multi-objective reinforcement learning methods address this challenge by finding an optimal coverage set of non-dominated policies that can satisfy any user's preference in solving the problem. However, this is achieved with costs of computational complexity, time consumption, and lack of adaptability to non-stationary environment dynamics. In order to address these limitations, there is a need for adaptive methods that can solve the problem in an online and robust manner. In this paper, we propose a novel developmental method that utilizes the adversarial self-play between an intrinsically motivated preference exploration component, and a policy coverage set optimization component that robustly evolves a convex coverage set of policies to solve the problem using preferences proposed by the former component. We show experimentally the effectiveness of the proposed method in comparison to state-of-the-art multi-objective reinforcement learning methods in stationary and non-stationary environments.

Keywords: Markov process; adversarial; decision making; intrinsic motivation; multi-objective optimization; reinforcement learning; self-play.

PubMed Disclaimer

Figures

**Figure 1**
A block diagram for a multiple policy MORL approach for solving MOMDPs.

**Figure 2**
The solution space of a two-objective problem. The red circles are representing the set of non-dominated solutions known as the Pareto front.

**Figure 3**
Graphical representation of the Convex hull concept in comparison to the Pareto front using a two objective example. **(A)** Pareto front surface represented by solid and dotted lines vs. Convex hull surface represented only by solid lines. **(B)** Convex hull surface in the weight space represented by the bold lines.

**Figure 4**
Markov decision process (MDP) in comparison to multi-objective Markov decision process (MOMDP). **(A)** Markov decision process (MDP). **(B)** Multi-objective Markov decision process (MOMDP).

**Figure 5**
Intrinsically motivated multi-objective reinforcement learning (IM-MORL) design scenarios. **(A)** The conventional MORL approach. **(B)** IM-MORL design approach guided by the user's preference. **(C)** IM-MORL design approach for learning the feasible preferences over multiple extrinsic rewards. **(D)** IM-MORL design approach for learning both internal goals and preferences.

**Figure 6**
The division of the linear scalarization of the preference space into a finite set of regions based on the combination of fuzzy membership values of the weight components.

**Figure 7**
A flowchart diagram describing the RFPB algorithm workflow.

**Figure 8**
A block diagram for the working mechanism of the proposed method.

**Figure 9**
Layouts of the experimental environments. **(A)** The search and rescue (SAR) environment. **(B)** The deep sea treasure (DST) environment. **(C)** The resource gathering (RG) environment.

**Figure 10**
Comparing our IM-MORL with the RM-MORL agent in terms of reward prediction error averaged over 15 runs to assess the impact of intrinsically motivated preference exploration. **(A)** The search and rescue (SAR) environment. **(B)** The deep sea treasure (DST) environment. **(C)** The resource gathering (RG) environment.

**Figure 11**
Comparing the median reward values for each user preference averaged over 15 runs with standard deviation bars in the stationary environments. **(A)** The search and rescue (SAR) environment. **(B)** The deep sea treasure (DST) environment. **(C)** The resource gathering (RG) environment.

**Figure 12**
A bar-chart comparing the normalized average *hypervolume* values with standard deviation for the OLS, TLO, and IM-MORL agents grouped by each stationary environment.

**Figure 13**
Comparing the median reward values for each user preference averaged over 15 runs with standard deviation bars in the non-stationary environments. **(A)** The search and rescue (SAR) environment. **(B)** The deep sea treasure (DST) environment. **(C)** The resource gathering (RG) environment.

**Figure 14**
A bar-chart comparing the normalized average *hypervolume* values with standard deviation for the OLS, TLO, and IM-MORL agents grouped by each non-stationary environments.

See this image and copyright information in PMC

References

1. Akrour R., Schoenauer M., Sebag M. (2011). Preference-Based Policy Learning. Berlin; Heidelberg: Springer Berlin Heidelberg.
1. Altman E. (1999). Constrained Markov Decision Processes, Vol. 7. London: CRC Press.
1. Barto A. G. (2013). Intrinsic motivation and reinforcement learning, in Intrinsically Motivated Learning in Natural and Artificial Systems, eds Baldassarre G., Mirolli M. (Berlin; Heidelberg: Springer; ), 17–47.
1. Beume N., Fonseca C. M., Lopez-Ibanez M., Paquete L., Vahrenhold J. (2009). On the complexity of computing the hypervolume indicator. IEEE Trans. Evol. Comput. 13, 1075–1082. 10.1109/TEVC.2009.2015575 - DOI
1. Busa-Fekete R., Szörényi B., Weng P., Cheng W., Hüllermeier E. (2014). Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Mach. Learn. 97, 327–351. 10.1007/s10994-014-5458-8 - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Affiliation

Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources