Reinforcement Learning
Table of Contents
Figure 1: A Taxonomy of RL Algorithms [from spinningup.openai.com]
1. Resources
Lectures:
- CS 285: Deep Reinforcement Learning - Sergey Levine: https://rail.eecs.berkeley.edu/deeprlcourse/
- CS 234: Reinforcement Learning - Emma Brunskill: https://www.youtube.com/playlist?list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u
Books:
- Reinforcement Learning: An Introduction by Sutton and Barto is still the intro book that estabilishes base for learning RL.
- Distributional Reinforcement Learning by Bellemare, Dabney & Rowland is a specialied book looking at RL in a new perspective that advances the theory of RL
- Deep Reinforcement Learning Hands-On by Maxim Lapan (2024) is practical and implementation focused approach
- Multi-Agent Reinforcement Learning: Foundations and Modern Approaches by Albrecht, Christianos & Schafer (2024) gives comprehensive intro to MARL
Software:
- Stable Baseline - Collection of RL Algorithms
Articles:
- A guide for learning Deep RL by OpenAI: spinningup.openai.com
- Debugging RL Algorithms andyljones.com talks about:
- Probe Environments
- Probe Agents
- Logs execessively
- Use really large batch size
- Rescale your rewards
- A collection of psuedocode of RL Algorithms with notes on which to use when : datadriveninvestor.com
- A collection of RL algorithms and their comparision: jonathan-hui.medium.com
- List of Thesis in RL: reddit.com
- A (Long) Peek into Reinforcement Learning : lilianweng.github.io
Podcasts:
- TalkRL: podcasts.apple.com
2. Multi Agent Reinforcement Learning
3. Papers
- Learning Multi-agent Implicit Communication Through Actions: A Case Study in Contract Bridge, a Collaborative Imperfect-Information Game. [pdf]
Learning Decision Trees With RL [pdf]
Trains a RNN network using RL, to decide which feature to split the decision tree next. Performs better than greedy strategy. Because greedy strategies look on immediate infromation gain, where RL can be trained for longterm payoff.
- Intention Conditioned Value Function (ICVF)
- Solving Offline Reinforcement Learning with Decision Tree Regression
- Goal-Conditioned Supervised Learning
- Reward Conditioned Policies
- The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games
4. Concepts
4.2. True Gradient Descent
Is DQN True Gradient Descent? No. They are approximations to it. CS234 - Lecture 6 (t=3253)
Because it is not a derivative of a loss function.
- GTD (Gradient Temporal Difference) are true gradient descent
- The first section of chapter on Function Approximation (Sutton & Barto) has few points on this.
4.3. DQN
- Huber Loss on Bellman Error https://youtu.be/gOV8-bC1_KU?t=4756
4.3.1. Replay Buffer
4.3.2. Stable Target
4.3.3. Prioritized Replay Buffer
4.3.4. Double DQN
4.3.5. Duelling DQN
4.3.6. Maximization Bias
- Double DQN
- Stable Target
prevent maximization bias
4.4. Actions are Infinite
From Assumptions of Decision Making Models in AGI:
It is unreasonable to assume that at any state, all possible actions are listed.
Actions in small scale may be discrete or a finite collection of distributions, but at the level where planning happens set of all possible actions is infinite.
Such actions can in principle to thought to be recursively composed of a set of basic operations/actions. But the decision making happens not at those basic actions but at level of composed actions.
- It is a task in itself to know what actions can be taken and what actions should we evaluate.
Thus decision making involves composing short timestep actions to get longer term action over which planning can be done.
i.e. decision making is often not about selection but selective composition. [Page 2]
So, one thing to explore would be Hierarchical Reinforcement Learning.
4.5. RL Without Rewards
What does this even mean?
Is it even RL if there are no rewards. So, this must be limited to improving performance in downstream tasks that do indeed have rewards?
What is the utility?
- helps when the reward is sparse
- skills developed without reward can act as primitive for hierarchical RL
Work done so far:
- ICVF
- Diversity is all you need.