Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play
- PMID: 30356836
- PMCID: PMC6189603
- DOI: 10.3389/fnbot.2018.00065
Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play
Abstract
Many real-world decision-making problems involve multiple conflicting objectives that can not be optimized simultaneously without a compromise. Such problems are known as multi-objective Markov decision processes and they constitute a significant challenge for conventional single-objective reinforcement learning methods, especially when an optimal compromise cannot be determined beforehand. Multi-objective reinforcement learning methods address this challenge by finding an optimal coverage set of non-dominated policies that can satisfy any user's preference in solving the problem. However, this is achieved with costs of computational complexity, time consumption, and lack of adaptability to non-stationary environment dynamics. In order to address these limitations, there is a need for adaptive methods that can solve the problem in an online and robust manner. In this paper, we propose a novel developmental method that utilizes the adversarial self-play between an intrinsically motivated preference exploration component, and a policy coverage set optimization component that robustly evolves a convex coverage set of policies to solve the problem using preferences proposed by the former component. We show experimentally the effectiveness of the proposed method in comparison to state-of-the-art multi-objective reinforcement learning methods in stationary and non-stationary environments.
Keywords: Markov process; adversarial; decision making; intrinsic motivation; multi-objective optimization; reinforcement learning; self-play.
Figures
References
-
- Akrour R., Schoenauer M., Sebag M. (2011). Preference-Based Policy Learning. Berlin; Heidelberg: Springer Berlin Heidelberg.
-
- Altman E. (1999). Constrained Markov Decision Processes, Vol. 7. London: CRC Press.
-
- Barto A. G. (2013). Intrinsic motivation and reinforcement learning, in Intrinsically Motivated Learning in Natural and Artificial Systems, eds Baldassarre G., Mirolli M. (Berlin; Heidelberg: Springer; ), 17–47.
-
- Beume N., Fonseca C. M., Lopez-Ibanez M., Paquete L., Vahrenhold J. (2009). On the complexity of computing the hypervolume indicator. IEEE Trans. Evol. Comput. 13, 1075–1082. 10.1109/TEVC.2009.2015575 - DOI
-
- Busa-Fekete R., Szörényi B., Weng P., Cheng W., Hüllermeier E. (2014). Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Mach. Learn. 97, 327–351. 10.1007/s10994-014-5458-8 - DOI
LinkOut - more resources
Full Text Sources
