Dreamer
Table of Contents
- Dreamer v1 - ICLR 2020 - Dream to Control: Learning Behaviors by Latent Imagination [pdf]
- Dreamer v2 - ICLR 2021 - Mastering Atari With Discrete World Models [pdf]
- Dreamer v3
- Dreamer v4 - arXiv 2025 - Training Agents Inside of Scalable World Models [pdf]
1. Dreamer v1
Dreamer v1 outperformed other methods in performance, sample efficiency and computational budget in the DeepMind continuous control problems. But it only work well in a few of the Atari environments.
2. Dreamer v2
Dreamer v2 performed better than other methods in Atari environments by making some tweaks to Dreamer v1. Specifically, using REINFORCE as policy update algorithm, using categorical latents instead of continuous and using a KL balancing technique for dynamics model.
Tested on 55 Atari games with 200M steps.
3. Dreamer v3
Dreamer v3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula. Moreover, it works with a single hyperparameter configuration for a wide range of tasks (Atari, ProcGen, DMLab, Minecraft, Atari100k, Proprio Control, Visual Control, BSuite).
4. Dreamer v4
Dreamer v4 uses transformer architecture to learn highly accurate world models for complex environments. And RL is also done completely inside the learned world model. Can collect diamond on Minecraft being completely trained on offline dataset.
The training is done completely offline. Thus has potential to be applied to physical robots for which partially trained robots can't be deployed and thus training completely in offline data is needed.
5. Plan2Explore - ICML 2020
Objective is to do exploration of environment without any task specific information. Then when given the task, be able to do the task in zero shot or few shot. The world model learnt by other (like Dreamer v2) doesn't generalize zero-shot to different tasks. But world model learnt by Plan2Explore does generalize.
6. Mastering URLB from Pixels - ICML 2023
URLB (Unsupervised RL Benchmark) is a benchmark for the kind of problem statement that Plan2Explore presents. This paper beats Plan2Explore.
Pre-training:
URLB has a 2M step pre-training phase.
Latent Bayesian Surprise (LBS) is used as intrinsic motivation reward. This rewards the agent for finding states that the world model struggles to predict.
Fine-Turning
URLB gives 100k steps for fine-tuning on a specific task.
The critic is always discarded while the actor is kept for dense reward problems only.
Dyna-MPC is used to plan from thousands of imagined trajectories to select a action before taking real step in the environment.
In URLB they have 93.59% score.
In addition to URLB they show good performance on Real-World RL benchmark (RWRL).
7. Some papers
- Choreographer - ICLR 2023
- https://arxiv.org/pdf/2211.13350
- Table 1 shows that Choreographer beats Plan2Explore (P2E).
- This paper doesn't compare with Mastering URLB from Pixels.
- Diversity is all you need (DIAYN) - ICLR 2019
- Constrained Ensemble Exploration for Unsupervised Skill Discovery - ICML 2024
- Balancing State Exploration and Skill Diversity in Unsupervised Skill Discovery
- IEEE Transactions on Cybernetics 2025
- https://arxiv.org/pdf/2309.17203