Dreamer v3

1. Problem
2. Representations
3. Results
4. World Model
5. Critic Model
6. Actor Model
7. Comparision
8. See also

arXiv 2023 - Mastering Diverse Domains through World Models [pdf]
Nature 2023 - Mastering diverse control tasks through world models [pdf]

Dreamer is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.

Dreamer v3 is a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration.

1. Problem

How to make the all the components (World Model, Critic, and Actor) work robustly with same hyperparameters across diverse environments.
Challenges are:
- Environments have different reward scales, observation scales,
- reward can be dense or sparse,
- action space can be continuous or discrete

2. Representations

Uses the same Recurrent State-Space Model as in Dreamer v2.

Latent dynamics model (World Model):

Transition Dynamics:
- Sequence Model (Recurrent Model): \(h_t = f_{\phi}(h_{t-1}, z_{t-1}, a_{t-1})\)
- Dynamics predictor (Transition Predictor): \(p_{\phi}(\hat z_t|h_t)\)
Latent representation:
- Encoder (Representation model): \(q_{\phi}(z_t| h_t, x_t)\)
- Decoder (Image predictor): \(p_{\phi}(\hat x_t| h_t, z_t)\)
Others:
- Reward model: \(p_{\phi}(r_t|h_t,z_t)\)
- Continuation Flag: \(p_{\phi}(\hat c_t|h_t, z_t)\) . This is bernoulli model.
Representations are sampled from a vector of softmax distributions and straight-through gradient estimates are taken.
Behaviour model
- Actor: \(\pi_\theta (a_t | \hat z_t, h_t)\)
- Critic: \(v_\psi(R_t| \hat z_t, h_t)\)
Citic loss function uses \(\lambda\) target, same as in Dreamer V1.

Actor and critic use \(h_t\) also, instead of just \(z_t\) in Dreamer v2.

Critic learn distribution of returns instead of just the expected value as in Dreamer v2.

3. Results

Works well on diverse environments (Atari, ProcGen, DMLab, Minecraft, Atari100k, Proprio Control, Visual Control, BSuite) with single set of hyperparameters.

Each technique (like symexp towhot regression, return normalization) is critical on a subset of tasks but maynot affect performance on other tasks. So, all of the techniques are important for robustness across variety of tasks.

4. World Model

Loss function for world model is a sum of the following:

Image reconstruction: \(-\ln p_{\phi}(x_t|h_t, z_t)\)

The target \(x_t\) is the symlog of the original image target. This way network doesn't need to predict large values. And in the encoder/representation model \(q_{\phi}(z_t | h_t, x_t)\) the input is also symlog of the original image target.
Reward: \(-\ln p_{\phi}(r_t| h_t, z_t)\)

Implemented as cross entropy between twohot¹ encoding of reward \(r_t\) and network softmax output for those buckets.

\(-twohot(r_t)^T \ln softmax(f(h_t, z_t))\)
Continuation Flag: \(-\ln p_{\phi}(c_t|h_t, z_t)\)
Encoder and Dynamics predictor:
- Dynamics model: \(\max(1, KL \left[ sg(q_{\phi}(z_t| h_t, x_t))\ ||\ p_{\phi}(z_t | h_t) \right]\)
- Representation model: \(0.1 \max(1, KL \left[ q_{\phi}(z_t| h_t, x_t)\ ||\ sg(p_{\phi}(z_t | h_t)) \right]\)
Using different weight for prior and posterior of KL divergence is the KL balancing technique used in Dreamer v2.

The \(max(1, \cdot)\) disable this loss function when the loss goes below 1 nat ≈ 1.44 bits. This means when dynamics is easy to learn, the dynamics model and representation model don't need to match exactly and thus the representation model has flexibility to encode information useful for other aspects like reconstruction, and reward. [See Page 5]

Tricks:

Dynamics predictor and encoder are mixure of 1% uniform and 99% neural network output. So, the KL divergence loss is well behaved.
Output of Reward model is initialized to zero

5. Critic Model

Critic model \(v_\psi(R_t| \hat z_t, h_t)\) gives a distribution of reward values. The output of the critic network is logits for softmax probability over exponentially spaced bins.

Loss function is \(-\ln p_\psi (R_t^\lambda|s_t)\) which is maximum likelihood loss with \(\lambda\) return as target. This loss is implemented similar to reward loss as cross entropy between two hot encoding of the return and the network softmax output for the bins.

Some other trick used for training Critic model:

Output of Critic model is initialized to zero.
Critic is regularized towards predicting the output of exponentially moving average of its own parameters.
Critic outputs categorical distribution with exponentially spaced bins.
Critic loss is applied to both imagined trajectory and to trajectories sampled from replay buffer.

6. Actor Model

Actor \(\pi_{\theta}(a_t|z_t, h_t)\) loss function:

Uses REINFORCE for both discrete and continuous actions

\(-\ln p_{\psi}(a_t|z_t,h_t) sg(R_t^{\lambda} - v_{\psi}(z_t, h_t)) / \max(1, S)\)
Uses entropy regualizer with

\(3\times 10^{-4} H {\left[ \pi_{\theta}(a_t| z_t, h_t) \right] }\)
Scales the returns to be approximately in [0,1]. Returns that are already within 1 are not scaled. The returns is scaled by exponential moving average of 5 to 95 prcentile of observed return values.

\(S = EMA(Per(R_t^{\lambda}, 95) - Per(R_{t^{\lambda},5}, 0.99))\)

7. Comparision

The changes in Dreamer v3 compared to previous iteration is presented in Appendix.

Robustness techniques: Observation symlog, KL balance and free bits, 1% unimix for all categoricals, percentile return normalization, symexp twohot loss for the reward head and critic.

Network Architecture: Block GRU, RMSProp normalization, SiLu activation

Optimizer: Adaptive gradient clipping (AGC), LaProp (RMSProp followed by momentum)

Replay buffer: Larger capacity

8. See also

Footnotes:

twohot encoding: Reward is is binned into exponentially spaced buckets. Twohot encoding is generalization of onehot encoding to continuous values. Twohot encoding encodes a continuous values as a vector where two adjacent values are nonzero. Those value gives weight (summing to one), and proportionally weighting the adjacent bin values such that, the weighted sum is the number we are encoding.