Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 9:12:65.
doi: 10.3389/fnbot.2018.00065. eCollection 2018.

Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Affiliations

Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

Sherif Abdelfattah et al. Front Neurorobot. .

Abstract

Many real-world decision-making problems involve multiple conflicting objectives that can not be optimized simultaneously without a compromise. Such problems are known as multi-objective Markov decision processes and they constitute a significant challenge for conventional single-objective reinforcement learning methods, especially when an optimal compromise cannot be determined beforehand. Multi-objective reinforcement learning methods address this challenge by finding an optimal coverage set of non-dominated policies that can satisfy any user's preference in solving the problem. However, this is achieved with costs of computational complexity, time consumption, and lack of adaptability to non-stationary environment dynamics. In order to address these limitations, there is a need for adaptive methods that can solve the problem in an online and robust manner. In this paper, we propose a novel developmental method that utilizes the adversarial self-play between an intrinsically motivated preference exploration component, and a policy coverage set optimization component that robustly evolves a convex coverage set of policies to solve the problem using preferences proposed by the former component. We show experimentally the effectiveness of the proposed method in comparison to state-of-the-art multi-objective reinforcement learning methods in stationary and non-stationary environments.

Keywords: Markov process; adversarial; decision making; intrinsic motivation; multi-objective optimization; reinforcement learning; self-play.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A block diagram for a multiple policy MORL approach for solving MOMDPs.
Figure 2
Figure 2
The solution space of a two-objective problem. The red circles are representing the set of non-dominated solutions known as the Pareto front.
Figure 3
Figure 3
Graphical representation of the Convex hull concept in comparison to the Pareto front using a two objective example. (A) Pareto front surface represented by solid and dotted lines vs. Convex hull surface represented only by solid lines. (B) Convex hull surface in the weight space represented by the bold lines.
Figure 4
Figure 4
Markov decision process (MDP) in comparison to multi-objective Markov decision process (MOMDP). (A) Markov decision process (MDP). (B) Multi-objective Markov decision process (MOMDP).
Figure 5
Figure 5
Intrinsically motivated multi-objective reinforcement learning (IM-MORL) design scenarios. (A) The conventional MORL approach. (B) IM-MORL design approach guided by the user's preference. (C) IM-MORL design approach for learning the feasible preferences over multiple extrinsic rewards. (D) IM-MORL design approach for learning both internal goals and preferences.
Figure 6
Figure 6
The division of the linear scalarization of the preference space into a finite set of regions based on the combination of fuzzy membership values of the weight components.
Figure 7
Figure 7
A flowchart diagram describing the RFPB algorithm workflow.
Figure 8
Figure 8
A block diagram for the working mechanism of the proposed method.
Figure 9
Figure 9
Layouts of the experimental environments. (A) The search and rescue (SAR) environment. (B) The deep sea treasure (DST) environment. (C) The resource gathering (RG) environment.
Figure 10
Figure 10
Comparing our IM-MORL with the RM-MORL agent in terms of reward prediction error averaged over 15 runs to assess the impact of intrinsically motivated preference exploration. (A) The search and rescue (SAR) environment. (B) The deep sea treasure (DST) environment. (C) The resource gathering (RG) environment.
Figure 11
Figure 11
Comparing the median reward values for each user preference averaged over 15 runs with standard deviation bars in the stationary environments. (A) The search and rescue (SAR) environment. (B) The deep sea treasure (DST) environment. (C) The resource gathering (RG) environment.
Figure 12
Figure 12
A bar-chart comparing the normalized average hypervolume values with standard deviation for the OLS, TLO, and IM-MORL agents grouped by each stationary environment.
Figure 13
Figure 13
Comparing the median reward values for each user preference averaged over 15 runs with standard deviation bars in the non-stationary environments. (A) The search and rescue (SAR) environment. (B) The deep sea treasure (DST) environment. (C) The resource gathering (RG) environment.
Figure 14
Figure 14
A bar-chart comparing the normalized average hypervolume values with standard deviation for the OLS, TLO, and IM-MORL agents grouped by each non-stationary environments.

Similar articles

References

    1. Akrour R., Schoenauer M., Sebag M. (2011). Preference-Based Policy Learning. Berlin; Heidelberg: Springer Berlin Heidelberg.
    1. Altman E. (1999). Constrained Markov Decision Processes, Vol. 7. London: CRC Press.
    1. Barto A. G. (2013). Intrinsic motivation and reinforcement learning, in Intrinsically Motivated Learning in Natural and Artificial Systems, eds Baldassarre G., Mirolli M. (Berlin; Heidelberg: Springer; ), 17–47.
    1. Beume N., Fonseca C. M., Lopez-Ibanez M., Paquete L., Vahrenhold J. (2009). On the complexity of computing the hypervolume indicator. IEEE Trans. Evol. Comput. 13, 1075–1082. 10.1109/TEVC.2009.2015575 - DOI
    1. Busa-Fekete R., Szörényi B., Weng P., Cheng W., Hüllermeier E. (2014). Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Mach. Learn. 97, 327–351. 10.1007/s10994-014-5458-8 - DOI

LinkOut - more resources