Dreamer v2
Table of Contents
Notes on paper Dreamer v2 (ICLR 2021) - Mastering Atari With Discrete World Models [pdf]
DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.
1. Representations
Latent dynamics model (World Model):
Transition Dynamics (Using Recurrent State Space Model):
- Recurrent Model: \(h_t = f_{\phi}(h_{t-1}, z_{t-1}, a_{t-1})\)
- Transition Predictor: \(p_{\phi}(\hat z_t|h_t)\)
Latent representation:
- Representation model: \(q_{\phi}(z_t| h_t, x_t)\)
- Image predictor: \(p_{\phi}(\hat x_t| h_t, z_t)\) - Similar to reconstruction model in Dreamer v1
Others:
- Reward model: \(p_{\phi}(r_t|h_t,z_t)\)
- Discount predictor: \(p_{\phi}(\gamma|h_t, z_t)\) . This is bernoulli model, with \(\gamma = 0\) when at terminal state, and \(\gamma = 0.999\) when at non terminal state.
Behaviour model
- Actor: \(p_\psi (a_t | \hat z_t)\)
- Critic: \(v_\zeta(\hat z_t) \sim E_{p_{\phi}, p_{\psi}} {\left[ \sum_{\tau \ge t} \hat \gamma^{\tau-t} \hat r_{\tau} \right]}\)
Citic loss function uses \(\lambda\) target, same as in Dreamer V1.
2. Changes
Summary of Modifications compared to Dreamer v1 is listed in Appendix C:
Using categorical variables for latent state instead of Gaussian latents used in Dreamer V1 [Section 2.1]
32 variables with 32 classes each. [Figure 2]
- REINFORCE for learning policy. But for the task (Continuous Control) that dreamer v1 sovled, they still use the gradients from value estimation.
- KL Balancing
- Bigger model size
3. Ablation
[See Figure 2]
Following components have a significant role in Dreamer v2 performance. The number in the bracket is the mean score when that component is removed. Compare with normal case mean score of 0.25.
- (0.01) Graidents from image reconstruction learning latent state. But gradient from reward model for learning state didn't impact much.
- (0.15) REINFORCE loss function is neccesary for learning unbiased policy. Using value estimate gradients didn't work for Atari.
- (0.16) KL Balancing
- (0.19) Discrete Latents
Limitations:
- Different policy update rule for Atari and Continous control.
4. Dynamics model
Loss function for world model is a sum of the following:
- Image reconstruction: \(-\ln p_{\phi}(x_t|h_t, z_t)\)
- Reward: \(-\ln p_{\phi}(r_t| h_t, z_t)\)
- Discount: \(-\ln p_{\phi}(\gamma_t|h_t, z_t)\)
Transition Dynamics: \(\beta KL \left[ q_{\phi}(z_t| h_t, x_t) || p_{\phi}(z_t | h_t) \right]\)
This is KL loss where the transition model is the prior, and representation model is the posterior. And KL loss tries to bring both together. But learning a transition model is difficult than learning a representation model. So, we want to put more weight on transition model being similar to representation model rather than vice versa. This can be doen by KL balancing (\(\alpha = 0.8\)):
kl_loss = alpha * compute_kl(stop_grad(representation_posterior), transition_prior) + (1 - alpha) * compute_kl(representation_posterior, stop_grad(transition_prior))
5. Action model
There are three components to the loss function:
REINFORCE loss function with state value \(v_{\zeta}(\hat z_t))\) as baseline:
\(-\rho \ln p_{\psi}(a_t|\hat z_t) sg(V_t^{\lambda} - v_{\zeta}(\hat z_t))\)
This is used only for Atari \(\rho=1\).
Gradients of the \(\lambda\) value estimate backpropagating through discrete latent dyanmics model and discrete actions using straight-through graidents.
\(- (1-\rho) V_t^{\lambda}\)
This is used only for continuous control \(\rho=0\).
- Entropy regularizer \(-\eta H[a_t| z_t]\)