DQN

1. Rainbow DQN

Paper: Rainbow: Combining Improvements in Deep Reinforcement Learning [pdf]

Various extensions to DQN have been proposed and implemented. This papers combines 6 of them and does ablation study.

Double Q-Learning: Separate the best action selection with q value estimation.

\[ \left( R_{t+1} + \gamma_{t+1} q_{\bar{\theta}} (S_{t+1}, \mathop{argmax}_{a'} q_{\theta}(S_{t+1}, a)) - q_{\theta}(S_{t}, A_t) \right)^2 \]

where,
- \(q_{\theta}\) is the online network being optimized
- \(q_{\bar{\theta}}\) is the target network (a periodic copy of online network)
Prioritized replay: Sample transitions with probablity proportional ¹ to training loss value
Dueling Network: Express Q value as interms of value function and advantage function
Multi-Step Learning: Train using n-step returns
Distributional RL: Keep track of distribution of returns instead of just average return of action (i.e. Q value)

Learning objective is to match the distribution of returns.
Noisy Nets: Incorporate noise in the Q network. So \(\epsilon\) - greedy exploration is not required. The noise leads to state-conditional exploration and later the networks learns to ignore the noise allowing a form of self-annealing.

Rainbow DQN gave very good results. It learned with far fewer samples and lead to much higher final reward too.

Ablation showed the following techniques contributed (in order of priority):

Prioritized Replay: Removing this component caused large drop in performance across all games.
Multi-Step Learning: Removing this component caused large drop in performance across all games.
Distributional Q-learning: Didn't matter much initially but for later stages of learning near or above human level, the performance lags behind without Distributional Q-Learning
Noisy Nets: Many games benefited by this techniques but some were negatively affected.

Other techniques didn't change the results much.

Dueling : For some games their impact was significant. But median performance across games didn't change much.
Double Q-Learning: Distributional RL was clipped at -10 to 10 range of rewards which nullified the effect of overestimation. So, double Q-learning wasn't much necessary. Perhaps at higher clipping ranges, it too would be important.

Proportional to the loss value raised to a power \(\omega\) (hyperparameter).