Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 14:16:932671.
doi: 10.3389/fnbot.2022.932671. eCollection 2022.

Learning multi-agent cooperation

Affiliations

Learning multi-agent cooperation

Corban Rivera et al. Front Neurorobot. .

Abstract

Advances in reinforcement learning (RL) have resulted in recent breakthroughs in the application of artificial intelligence (AI) across many different domains. An emerging landscape of development environments is making powerful RL techniques more accessible for a growing community of researchers. However, most existing frameworks do not directly address the problem of learning in complex operating environments, such as dense urban settings or defense-related scenarios, that incorporate distributed, heterogeneous teams of agents. To help enable AI research for this important class of applications, we introduce the AI Arena: a scalable framework with flexible abstractions for associating agents with policies and policies with learning algorithms. Our results highlight the strengths of our approach, illustrate the importance of curriculum design, and measure the impact of multi-agent learning paradigms on the emergence of cooperation.

Keywords: artificial intelligence; learned cooperation; multi-agent; policy learning; reinforcement learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Simultaneous multi-agent multi-strategy policy learning. The figure illustrates an example multi-agent setting where the policies of individual agents are learned simultaneously with different learning strategies.
Figure 2
Figure 2
Possible worker configurations. Policy workers may be attached in environment entities in any desired combination. (A) Three entities are assigned to three independent policies that are learning separately and may have workers attached to other environments. (B) Entities are each attached to workers of the same policy, such that some of the agents contributing to the policy are coexisting in the same environment. (C) Entities are all attached to the same policy worker, which takes all of their data into account in a multiagent manner, possibly one of many workers for a distributed multiagent algorithm.
Figure 3
Figure 3
TanksWorld Multi-Agent Environment for AI Safety Research. These images illustrate different views of the Tanksworld environment (Left) is a birds-eye rendering of the environment, (Center) is an agent's-view rendering, and (Right) is the state representation actually provided as observations to the RL algorithm.
Figure 4
Figure 4
Pseudo-code and process diagram for the TanksWorld training scheme. The training scheme was used by both the static and dynamic curriculum. (Left) Pseudo-code of the training scheme against multiple policies. “ppo” refers to a PPO (Schulman et al., 2017) policy that is being trained, while the other policies refer to custom frozen policies. (Right) The resulting processes and their organization. Four environments are created, each housing 10 entities. The agents in the competition are depicted as nodes and colored based on the policies that they follow. All blue tanks are contributing to a single policy.
Figure 5
Figure 5
Pseudo-code for the dynamic vs. static training curriculum. The dynamic curriculum increased penalties for safety violations from 0 to 0.3 in increments of 0.05 distributed evenly over four million steps. The static curriculum keeps the penalty of safety violations at 0.3 throughout training. The safety violation parameter sets the penalty for damage or death caused by an ally to another ally or neutral entity.
Figure 6
Figure 6
Round-based curriculum training organized with the AI Arena (brown). Successive rounds of training increased the difficulty by slowing introducing safety penalties over three rounds of training with penalty weights (0, 0.05, 0.1, 0.15, 0.20, 0.25, 0.3). The baseline (green) starts with the penalty threshold of 0.3. The result illustrates the value of successive rounds of curriculum training for teams of tanks in the AI Safety Challenge domain.
Figure 7
Figure 7
Cooperative navigation environment and behavior. The Cooperative Navigation environment has three targets (black squares) and three agents (circles). The agents must coordinate to cover all targets. (Left) Illustration of environment and potential solution. (Right) Traces of the testing behavior from the learned MASAC policies. The actions were reduced in magnitude to create slow paths to the targets. Some targets appear as outlines, showing that the agent happened to start on or near that target. The traces are interesting in that they show clear coordination but occasionally sub-optimal pairings of entities and targets. If the actions were not reduced, such that the entities reached the targets in only a handful of steps, these sub-optimalities would have little consequence on score.
Figure 8
Figure 8
Training curves for MASAC and comparisons. (Left) Entity assignments for the three approaches: Truly multiagent policy (MASAC), treating each entity as an SAC worker, and grouping all entities into a single agent. In all cases, the assignment was duplicated over several environments for distributed training. (Right) The corresponding training curves for each approach. MASAC was the only successful algorithm, making slow and deliberate progress for roughly 18 million steps before leveling off.

References

    1. Abel D. (2019). Simple_rl: reproducible reinforcement learning in python, in RML@ ICLR (New Orleans, LA: ).
    1. Bengio Y., Louradour J., Collobert R., Weston J. (2009). Curriculum learning, in Proceedings of the 26th Annual International Conference on Machine Learning (New York, NY: ), 41–48. 10.1145/1553374.1553380 - DOI
    1. Brown N., Sandholm T. (2019). Superhuman ai for multiplayer poker. Science 365, 885–890. 10.1126/science.aay2400 - DOI - PubMed
    1. Busoniu L., Babuska R., De Schutter B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C 38, 156–172. 10.1109/TSMCC.2007.913919 - DOI - PubMed
    1. Cai Y., Yang S. X., Xu X. (2013). A combined hierarchical reinforcement learning based approach for multi-robot cooperative target searching in complex unknown environments, in 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) (Singapore: ), 52–59. 10.1109/ADPRL.2013.6614989 - DOI

LinkOut - more resources