Research
Second-person and top-down views of a BYOL-Explore agent solving the Thow-Across level of DM-HARD-8, while pure RL and other baseline exploration methods fail to make progress in Thow-Across.
Curiosity-driven exploration is the active process of seeking new information to enhance the agent’s understanding of the environment. Suppose the agent has learned a model of the world that can predict future events, given the history of past events. The curiosity-based agent can then use the prediction mismatch of the global model as the intrinsic reward to direct its exploration policy toward seeking new information. As such, the agent can then use this new information to improve the global model itself so that it can make better predictions. This iterative process can allow the agent to eventually explore every innovation in the world and use this information to build an accurate global model.
Inspired by his successes bootstrap your own latent (BYOL) – which has been implemented in computer vision, Learning graph representationand representation learning in RL – we propose BYOL-Explore: a conceptually simple but general curiosity-driven artificial intelligence agent for solving hard exploration tasks. BYOL-Explore learns a representation of the world by predicting its own future representation. It then uses the representation-level prediction error as an intrinsic reward to train a curiosity-based policy. Therefore, BYOL-Explore learns a global representation, the dynamics of the world, and a curiosity-based exploration policy altogether by simply optimizing the prediction error at the representation level.
Comparison between BYOL-Explore, Random Network Distillation (RND), Intrinsic Curiosity Module (ICM) and pure RL (no intrinsic reward), in terms of average human normalized score (CHNS).
Despite the simplicity of its design, when applied to DM-HARD-8 suite of challenging 3-D, visually complex and hard-hitting exploration tasks, BYOL-Explore goes beyond typical curiosity-based exploration methods such as Random network distillation (RND) and Intrinsic Curiosity module (ICM), in terms of the mean highest human normalized score (CHNS), measured across all tasks. Remarkably, BYOL-Explore achieved this performance using only a single network trained simultaneously on all tasks, whereas previous work was limited to a single task setting and could only make significant progress on these tasks when provided with human demonstrations specialists.
As further proof of its generality, BYOL-Explore achieves superhuman performance on the ten most difficult exploration Atari gameswhile it has a simpler design than other competing agents, such as e.g Agent 57 and Go-Explore.
Comparison between BYOL-Explore, Random Network Distillation (RND), Intrinsic Curiosity Module (ICM) and pure RL (no intrinsic reward), in terms of average human normalized score (CHNS).
Moving forward, we can generalize BYOL-Explore to highly stochastic environments by learning a probabilistic global model that could be used to generate trajectories of future events. This could allow the agent to model the possible stochasticity of the environment, avoid stochastic pitfalls, and plan exploration.