Deep Reinforcement Learning: CS 285 Fall 2021
Table of Contents
1. Lecture 1
2. L1P1: Introduction
Figure 1: How is RL different from other ML
RL is different
- because data is not i.i.d
- ground truth answer is not know
RL is not just for games and robots.
- traffic control (by Cathy Wu) 00:09:15
3. L1P2: Why Deep RL?
- because intelligent agents need to adapt 00:00:40
- deep learning helps us handle unstructured environment 00:02:19
- RL provides formalism for behavior 00:03:23
Figure 2: Deep RL allow End-to-End Learning
- Recognition part of problem and Control part of problem can work together 00:07:49
Why study this now? 00:12:03
- advancement in RL
- advancement in Deep Learning
- computation power
Figure 3: RL is not new
4. L1P3: Beyond learning from Reward
Maximizing rewards is not the only problem that matter for sequential decision making: 00:00:59
- Inverse RL: Learning Reward Functions from examples
- Transfer Learning, Meta-Learning: Transferring knowledge between domains
- Learning to predict and using prediction to act
00:03:13 Where do rewards come from?
As human agents, we are accustomed to operating with rewards that are so sparse that we only experience them once or twice in a lifetime, if at all.
Figure 4: Source of Rewards
Other form of supervision:
- Learning from demonstration
- Learning from observing the world
- Learning to predict (00:07:03 model based RL)
- Unsupervised learning
- Learning from other tasks
- Transfer learning
- Meta-learning: learning to learn
5. L1P4: How do we build intelligent machines?
Ideas
- Learning as basis of intelligence
- An algorithm for each "module" of brain? or a single flexible algorithm? 00:03:55
- Brain is flexible: Seeing with your tongue (humans can learn to see in some extend using tongue 00:04:38)
Some evidence in favour of deep learning
- DeepRL learns features simlar to found in brain for touch and vision 00:07:20
- RL is observed in brain (Basal gaglia ~= reward system) 00:08:00
- High proficiency in domains with simple rules (go, atari,)
- Learn simple skills with raw sensory inputs (robots,)
- Learn from imitating human provided expert behavior (driving,)
But, human are better
- can learn incredibly quickly,
- can reuse past knowledge
- not clear what the reward function should be
- not clear what the role of prediction should be 00:10:19
6. Lecture 2
6.1. L2P1: Supervised Learning of Behaviours
- In RL we deal with sequential decision making problems.
- We can predict action in place of the classes (as in supervised classification problem)
- st : state - markovian state
- ot : observation - can be incomplete
- 00:08:36 If you are using observation past observation may give you extra information because information is incomplete. And only state is markovian, observation need not be.
00:10:18 Imitation Learning or behavioral cloning
- ALVINN: Autonomous Land Vehicle In a Neural Network was one of the first imitation learning system for AV 00:12:32
- In general Imitation Learning doesn't work, although supervised learning works in other problem. This is because the policy can deviate slightly from training trajectory and then when in the new state it has higher chance of making more mistakes.
Figure 5: Problem with Imitation Learning
but after a lot of data, it 00:14:50 works. In this particular case (NVIDIA's autonomous driving) because of techniques used in training:
00:16:59 The general principle is to modify you training data to illustrate the mistakes and shows how to correct them, then the policy might learn to be more robust.
00:18:43 The problem with Imitation Learning is that the training data distribution is different from test distribution \(p_{\pi}(o_t)\) is different from \(p_{data}(o_t)\)
So, Can we make those distribution more close?
- Yes. If we make the policy perfect. But that's difficult.
Instead make the data distribution closer to test distribution.
DAgger: Dataset Aggregation 00:20:29: Collect training data from \(p_{\pi}(o_t)\)
- run the policy
- label the data
- train with aggregated data
- repeat
but still data labeling is difficult. Thus, this is also not seem much in practise.
Figure 6: DAgger: Dataset Aggregation