2023-07-09

Deep Reinforcement Learning: CS 285 Fall 2021

Table of Contents

by Sergey Levine

1. Lecture 1

2. L1P1: Introduction

how_is_rl_different_from_other_ml-20230709120854.png

Figure 1: How is RL different from other ML

RL is different

  • because data is not i.i.d
  • ground truth answer is not know

RL is not just for games and robots.

  • traffic control (by Cathy Wu) 00:09:15

3. L1P2: Why Deep RL?

  • because intelligent agents need to adapt 00:00:40
  • deep learning helps us handle unstructured environment 00:02:19
  • RL provides formalism for behavior 00:03:23

deep_rl_allow_end_to_end_learning-20230709121659.png

Figure 2: Deep RL allow End-to-End Learning

  • Recognition part of problem and Control part of problem can work together 00:07:49

Why study this now? 00:12:03

  • advancement in RL
  • advancement in Deep Learning
  • computation power

rl_is_not_new-20230709122018.png

Figure 3: RL is not new

4. L1P3: Beyond learning from Reward

Maximizing rewards is not the only problem that matter for sequential decision making: 00:00:59

  • Inverse RL: Learning Reward Functions from examples
  • Transfer Learning, Meta-Learning: Transferring knowledge between domains
  • Learning to predict and using prediction to act

00:03:13 Where do rewards come from?

As human agents, we are accustomed to operating with rewards that are so sparse that we only experience them once or twice in a lifetime, if at all.

source_of_rewards-20230709122525.png

Figure 4: Source of Rewards

Other form of supervision:

  • Learning from demonstration
  • Learning from observing the world
    • Learning to predict (00:07:03 model based RL)
    • Unsupervised learning
  • Learning from other tasks
    • Transfer learning
    • Meta-learning: learning to learn

5. L1P4: How do we build intelligent machines?

Ideas

Some evidence in favour of deep learning

  • DeepRL learns features simlar to found in brain for touch and vision 00:07:20
  • RL is observed in brain (Basal gaglia ~= reward system) 00:08:00

What can RL do well now?

  • High proficiency in domains with simple rules (go, atari,)
  • Learn simple skills with raw sensory inputs (robots,)
  • Learn from imitating human provided expert behavior (driving,)

But, human are better

  • can learn incredibly quickly,
  • can reuse past knowledge
  • not clear what the reward function should be
  • not clear what the role of prediction should be 00:10:19

6. Lecture 2

6.1. L2P1: Supervised Learning of Behaviours

  • In RL we deal with sequential decision making problems.
  • We can predict action in place of the classes (as in supervised classification problem)
  • st : state - markovian state
  • ot : observation - can be incomplete
  • 00:08:36 If you are using observation past observation may give you extra information because information is incomplete. And only state is markovian, observation need not be.

00:10:18 Imitation Learning or behavioral cloning

  • ALVINN: Autonomous Land Vehicle In a Neural Network was one of the first imitation learning system for AV 00:12:32
  • In general Imitation Learning doesn't work, although supervised learning works in other problem. This is because the policy can deviate slightly from training trajectory and then when in the new state it has higher chance of making more mistakes.

problem_with_imitation_learning-20230710082620.png

Figure 5: Problem with Imitation Learning

  • but after a lot of data, it 00:14:50 works. In this particular case (NVIDIA's autonomous driving) because of techniques used in training:

    00:16:59 The general principle is to modify you training data to illustrate the mistakes and shows how to correct them, then the policy might learn to be more robust.

  • 00:18:43 The problem with Imitation Learning is that the training data distribution is different from test distribution \(p_{\pi}(o_t)\) is different from \(p_{data}(o_t)\)

    So, Can we make those distribution more close?

    • Yes. If we make the policy perfect. But that's difficult.
    • Instead make the data distribution closer to test distribution.

      DAgger: Dataset Aggregation 00:20:29: Collect training data from \(p_{\pi}(o_t)\)

      • run the policy
      • label the data
      • train with aggregated data
      • repeat

      but still data labeling is difficult. Thus, this is also not seem much in practise.

dagger_dataset_aggregation-20230710083707.png

Figure 6: DAgger: Dataset Aggregation


You can send your feedback, queries here