Action-driven contrastive representation for reinforcement learning

Minbeom Kim¹, Kyeongha Rho², Yong-Duk Kim², Kyomin Jung^{1

3}

Affiliations

¹ Graduate School of Artificial Intelligence, Seoul National University, Seoul, Republic of Korea.
² Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea.
³ Defense AI Technology Center, Agency for Defense Development, Daejeon, Republic of Korea.

PMID: 35303031
PMCID: PMC8932622
DOI: 10.1371/journal.pone.0265456

Action-driven contrastive representation for reinforcement learning

Minbeom Kim et al. PLoS One. 2022.

. 2022 Mar 18;17(3):e0265456.

doi: 10.1371/journal.pone.0265456. eCollection 2022.

Authors

Minbeom Kim¹, Kyeongha Rho², Yong-Duk Kim², Kyomin Jung^{1

3}

Affiliations

¹ Graduate School of Artificial Intelligence, Seoul National University, Seoul, Republic of Korea.
² Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea.
³ Defense AI Technology Center, Agency for Defense Development, Daejeon, Republic of Korea.

PMID: 35303031
PMCID: PMC8932622
DOI: 10.1371/journal.pone.0265456

Abstract

In reinforcement learning, reward-driven feature learning directly from high-dimensional images faces two challenges: sample-efficiency for solving control tasks and generalization to unseen observations. In prior works, these issues have been addressed through learning representation from pixel inputs. However, their representation faced the limitations of being vulnerable to the high diversity inherent in environments or not taking the characteristics for solving control tasks. To attenuate these phenomena, we propose the novel contrastive representation method, Action-Driven Auxiliary Task (ADAT), which forces a representation to concentrate on essential features for deciding actions and ignore control-irrelevant details. In the augmented state-action dictionary of ADAT, the agent learns representation to maximize agreement between observations sharing the same actions. The proposed method significantly outperforms model-free and model-based algorithms in the Atari and OpenAI ProcGen, widely used benchmarks for sample-efficiency and generalization.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Reconstruction of auto-encoder based representation.**
This is the Bigfish of ProcGen environment, which has various wallpaper patterns in the background. Under the auxiliary task of reconstructing original inputs, left images are pixel-inputs and right are reconstructions of their representations. It shows a problem that if there are various control-irrelevant details in wallpaper pixels, the encoder first learns abstraction about irrelevant details for gameplay instead of the essential pixels (i.e., fish in this example). Our proposal was motivated by this issue.

**Fig 2. The overall framework of ADAT.**
State-action pairs are sampled from a history of interactions. The dictionary is built with states by action types, labeling keys based on the query’s action. Through augmentation ‘Random-Translate’ [22], query is encoded for reinforcement learning and query-key pairs are encoded for contrastive learning [20]. Only the query encoder learns from contrastive loss, and the key encoder is trained as Momentum Update [19].

**Fig 3. The sampling-efficacy following by two methods.**
By extracting twice the amount and sorting them, all elements are used equally for learning. Unbiased Sampling prevents unwise early records from being leveraged much more than rich labels.

**Fig 4. The frame-stacked inputs.**
For Atari agents, four sequentially frame-stacked observations are translated from 84x84 to 92x92 pixels with zero pads. Likewise, for ProcGen agents, two frame-stacked observations are translated from 64x64 to 72x72 pixels [22].

**Fig 5. Comparison of generalization capability.**
Bigfish and Plunder Games in OpenAI Progen. 200-levels trained agents were evaluated on 100,000 levels. On-policy ADAT leveraged the samples rolled out by PPO policy. A difference in both frameworks is contrastive representation learning through ADAT. This figure reports the average score and standard deviation over five random seeds.

**Fig 6. Saliency map of Bigfish and Plunder Games in OpenAI Progen.**
From left, original rendering, saliency map of the PPO, and the PPO with ADAT in order.

See this image and copyright information in PMC

References

1. Such F, Madhavan V, Liu R, Wang R, Castro P, Li Y, et al. An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents. In: International Joint Conference on Artificial Intelligence; 2019. pp. 3260–3267.
1. Anand A, Racah E, Ozair S, Bengio Y, Côté MA, Hjelm RD. Unsupervised state representation learning in atari. In: Advances in neural information processing systems; 2019. pp. 8769–8782.
1. Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R. Improving sample efficiency in model-free reinforcement learning from images. arXiv preprint arXiv:191001741. 2019;.
1. Gregor K, Besse F. Temporal Difference Variational Auto-Encoder. ArXiv. 2019;abs/1806.03107.
1. Higgins I, Pal A, Rusu AA, Matthey L, Burgess CP, Pritzel A, et al. Darla: Improving zero-shot transfer in reinforcement learning. arXiv preprint arXiv:170708475. 2017;.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Action-driven contrastive representation for reinforcement learning

Affiliations

Action-driven contrastive representation for reinforcement learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources