2026-05-08

Dreamer v4

Table of Contents

Notes on paper Dreamer v4 (arXiv 2025) - Training Agents Inside of Scalable World Models [pdf]

The architecture of Dreamer v3 lacks the ability to fit complex real world distributions. Dreamer v4 uses transformer architecture for this purpose.

By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.

1. Results

  • Real-time interactive inference on a single H100 GPU through a shortcut forcing objective & efficient transformer architecture.
  • World model learns mostly from diverse unlabeled videos and requires only a small amount of video with action labels.

2. World Model

Representation:

World model consists of tokenizer and dynamics model.

Dynamics model is a diffusion model, trained with flow matching framework. It is actually as shortcut model, and thus can model the dynamics in single or very few steps as well.

Dreamer v4 uses a transformer based architecture in contrast to Recurrent State Space models in previous Dreamer versions. The transformer is 2D, with spatial axis and causal temporal axis.

Training:

  1. World Model Pretraining: The tokenizer and dynamics model is trained on offline dataset of videos and actions.
  2. Agent Finetuning (Behaviour Cloning):

    Task embedding tokens (called Agent tokens) are added to the input sequence, and a reward head, and policy head is added to the transformer to predict reward and action. So, the transformer now gives dynamics, reward and policy output.

    The agent token is allowed to attend to all other tokens and output the reward and policy. While other tokens are not allowed to attend to the agent token. So, the dynamics prediction is not influenced by the task specific agent token.

    The model is finetuned in a mixture of 50% original dataset, and 50% sucessful task specific dataset.

  3. Imagination Training (Reinforcement Learning)

    RL using PMPO algorithm is done. PMPO just consider the sign of the reward for training. This skips the issue normalizing reward, and ensures that the model focuses equally on all tasks even if they have different reward scales.

3. Differences

Compared to previous dreamer iterations, this version has following differences:

  1. Uses transformer instead of Recurrent State Space Model.
  2. Uses PMPO instead of REINFORCE or value estimate gradients for policy training.
  3. The model size is way huge: total 2B (400M tokenizer, 1.6B dynamics model). And the model is trained on 256 to 1024 TPU-v5p. In contrast previous dreamer models were trained on single GPU.

4. Details

4.1. Flow matching for diffusion

Here, the network \(f_{\theta}(x_\tau,\tau)\) learns to predict the velocity vector \(v = x_1 - x_0\), and sampling is done in \(K\) steps with step size \(d = 1/K\) as:

\begin{align*} x_{\tau+d} = x_{\tau}+ f_{\theta}(x_{\tau},\tau) d \end{align*}

It might seem that the sampling works better when the network target is \(x_1 - x_\tau\). In fact

\begin{align*} x_1 - x_{\tau} = (1 - \tau)(x_1-x_0) \end{align*}

so, it is just a reparameterization with a \((1 - \tau)\) factor. But it is easier for network to learn \(x_1 - x_0\) because there is no dependence on \(\tau\). Also, the update equation using \(x_1 - x_\tau\) needs a division by \(1-\tau\) which has a singularity as \(\tau \rightarrow 1\).

4.2. Shortcut Model

If we condition the network in the stepsize as well, i.e. \(f_\theta(x_{\tau}, \tau, d)\) then during inference we can sample with fewer steps (2-4 steps). For training, \(d\) is sampled in interval from \(1/1, 1/2, 1/4, ...1/K_\max\). And for \(d = 1/K_\max\), the flow matching loss function is used, while for other d, the target is average of \(f_\theta(x_\tau, \tau, d/2)\) and \(f_\theta(x_\tau, \tau + d/2, d/2)\).

5. See also


Backlinks


Found this interesting? Subscribe to new posts.
Any comments? Send an email.