Deep Reinforcement Learning: CS 285 Fall 2021

1. Lecture 1
2. L1P1: Introduction
3. L1P2: Why Deep RL?
4. L1P3: Beyond learning from Reward
5. L1P4: How do we build intelligent machines?
6. Lecture 2
- 6.1. L2P1: Supervised Learning of Behaviours
- 6.2. L2P2
7. Links

by Sergey Levine

1. Lecture 1

2. L1P1: Introduction

Figure 1: How is RL different from other ML

RL is different

because data is not i.i.d
ground truth answer is not know

RL is not just for games and robots.

traffic control (by Cathy Wu) 00:09:15

3. L1P2: Why Deep RL?

because intelligent agents need to adapt 00:00:40
deep learning helps us handle unstructured environment 00:02:19
RL provides formalism for behavior 00:03:23

Figure 2: Deep RL allow End-to-End Learning

Recognition part of problem and Control part of problem can work together 00:07:49

Why study this now? 00:12:03

advancement in RL
advancement in Deep Learning
computation power

Figure 3: RL is not new

4. L1P3: Beyond learning from Reward

Maximizing rewards is not the only problem that matter for sequential decision making: 00:00:59

Inverse RL: Learning Reward Functions from examples
Transfer Learning, Meta-Learning: Transferring knowledge between domains
Learning to predict and using prediction to act

00:03:13 Where do rewards come from?

As human agents, we are accustomed to operating with rewards that are so sparse that we only experience them once or twice in a lifetime, if at all.

Figure 4: Source of Rewards

Other form of supervision:

Learning from demonstration
- Directly copying observed behavior (00:05:17)
- Inferring rewards from observed behavior (00:05:51, 00:06:30)
Learning from observing the world
- Learning to predict (00:07:03 model based RL)
- Unsupervised learning
Learning from other tasks
- Transfer learning
- Meta-learning: learning to learn

5. L1P4: How do we build intelligent machines?

Ideas

Learning as basis of intelligence
An algorithm for each "module" of brain? or a single flexible algorithm? 00:03:55
- Brain is flexible: Seeing with your tongue (humans can learn to see in some extend using tongue 00:04:38)

Some evidence in favour of deep learning

DeepRL learns features simlar to found in brain for touch and vision 00:07:20
RL is observed in brain (Basal gaglia ~= reward system) 00:08:00

What can RL do well now?

High proficiency in domains with simple rules (go, atari,)
Learn simple skills with raw sensory inputs (robots,)
Learn from imitating human provided expert behavior (driving,)

But, human are better

can learn incredibly quickly,
can reuse past knowledge
not clear what the reward function should be
not clear what the role of prediction should be 00:10:19

6. Lecture 2

6.1. L2P1: Supervised Learning of Behaviours

In RL we deal with sequential decision making problems.
We can predict action in place of the classes (as in supervised classification problem)
s_t : state - markovian state
o_t : observation - can be incomplete
00:08:36 If you are using observation past observation may give you extra information because information is incomplete. And only state is markovian, observation need not be.

00:10:18 Imitation Learning or behavioral cloning

ALVINN: Autonomous Land Vehicle In a Neural Network was one of the first imitation learning system for AV 00:12:32
In general Imitation Learning doesn't work, although supervised learning works in other problem. This is because the policy can deviate slightly from training trajectory and then when in the new state it has higher chance of making more mistakes.

Figure 5: Problem with Imitation Learning

but after a lot of data, it 00:14:50 works. In this particular case (NVIDIA's autonomous driving) because of techniques used in training:

00:16:59 The general principle is to modify you training data to illustrate the mistakes and shows how to correct them, then the policy might learn to be more robust.
00:18:43 The problem with Imitation Learning is that the training data distribution is different from test distribution \(p_{\pi}(o_t)\) is different from \(p_{data}(o_t)\)

So, Can we make those distribution more close?
- Yes. If we make the policy perfect. But that's difficult.
- Instead make the data distribution closer to test distribution.
  
  DAgger: Dataset Aggregation 00:20:29: Collect training data from \(p_{\pi}(o_t)\)
  - run the policy
  - label the data
  - train with aggregated data
  - repeat
  but still data labeling is difficult. Thus, this is also not seem much in practise.

Figure 6: DAgger: Dataset Aggregation

6.2. L2P2

L2P3 L2P4 L2P5 L2P6