Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 11;14(9):e0222215.
doi: 10.1371/journal.pone.0222215. eCollection 2019.

Multi-agent reinforcement learning with approximate model learning for competitive games

Affiliations

Multi-agent reinforcement learning with approximate model learning for competitive games

Young Joon Park et al. PLoS One. .

Abstract

We propose a method for learning multi-agent policies to compete against multiple opponents. The method consists of recurrent neural network-based actor-critic networks and deterministic policy gradients that promote cooperation between agents by communication. The learning process does not require access to opponents' parameters or observations because the agents are trained separately from the opponents. The actor networks enable the agents to communicate using forward and backward paths while the critic network helps to train the actors by delivering them gradient signals based on their contribution to the global reward. Moreover, to address nonstationarity due to the evolving of other agents, we propose approximate model learning using auxiliary prediction networks for modeling the state transitions, reward function, and opponent behavior. In the test phase, we use competitive multi-agent environments to demonstrate by comparison the usefulness and superiority of the proposed method in terms of learning efficiency and goal achievements. The comparison results show that the proposed method outperforms the alternatives.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the proposed CTRL when two adversarial teams exist.
Each team is trained independently.
Fig 2
Fig 2. Architectures of the CTRL with an AMLAPN.
In the training phase, the observations are sequentially processed by the actor and critic along the arrows. The gradient signals are propagated in the reverse direction of the arrows. The shaded regions represent the auxiliary prediction networks for the approximate model learning. The number of units are shown within parentheses.
Fig 3
Fig 3
Illustrations of the experimental environment for four scenarios: physical deception (top left), keep-away (top right), predator-prey (bottom left), and complex predator-prey (bottom right).
Fig 4
Fig 4
Learning curves for the four competitive scenarios: episode rewards in physical deception (top left), episode rewards in keep away (top right), episode rewards in predator-prey (bottom left), and episode rewards in complex predator-prey (bottom right) scenarios. Each bar cluster represents the converged episode reward at the end of training. The shading region is a 95% confidence interval across the different random seeds.
Fig 5
Fig 5
Relative performances in round-robin tournament evaluations: the performances of team A trained by four methods (a), and the performances of team B trained by four methods (b). Each bar cluster shows the score for a set of competing policies; a higher score is better for the agent.
Fig 6
Fig 6
Learning curves in the partial observation environments: episode rewards in predator-prey (left) and episode rewards in complex predator-prey (right) scenarios. Each bar cluster represents the converged episode reward at the end of training. The shading region is a 95% confidence interval across the different random seeds.
Fig 7
Fig 7
Average relative performances with partial observation in round-robin tournament evaluations: performances of team A trained by four methods (a) and performances of team B trained by four methods (b). Each bar cluster shows the score for a set of competing policies; a higher score is better for the agent.

Similar articles

Cited by

References

    1. Cao Y., Yu W., Ren W., & Chen G. (2013). An Overview of Recent Progress in the Study of Distributed Multi-Agent Coordination. IEEE Transactions on Industrial Informatics, 9(1), 427–438. 10.1109/TII.2012.2219061 - DOI
    1. Ye D., Zhang M., & Yang Y. (2015). A Multi-Agent Framework for Packet Routing in Wireless Sensor Networks. Sensors (Basel, Switzerland), 15(5), 10026–10047. 10.3390/s150510026 - DOI - PMC - PubMed
    1. Ying W., & Dayong S. (2005). Multi-agent framework for third party logistics in E-commerce. Expert Systems with Applications, 29(2), 431–436. 10.1016/j.eswa.2005.04.039 - DOI
    1. Matarić M. J. (1997). Reinforcement Learning in the Multi-Robot Domain. Autonomous Robots, 4(1), 73–83. 10.1023/A:1008819414322 - DOI
    1. Jaderberg M., Czarnecki W. M., Dunning I., Marris L., Lever G., Castaneda A. G., et al. (2018). Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. Retrieved from https://arxiv.org/abs/1807.01281v1 - PubMed

Publication types