Goal-Conditioned Supervised Learning
Table of Contents
- https://dibyaghosh.com/blog/rl/gcsl.html
- https://www.youtube.com/watch?v=-vMcPk2Uc8g
- paper: Learning to Reach Goals via Iterated Supervised Learning - 1912.06088v4.pdf
Authors: Dibya Ghosh, Benjamin Eysenbach, Sergey Levine
1. Any trajectory is optimal if the goal is the final state of trajectory
Any trajectory is a successful demonstration for reaching the final state in that same trajectory. (pg. 1)
2. Comparision with HER
GCSL is different from Hindsight Experience Replay. (See 00:10:33 Comparision with HER)
HER | GCSL | |
---|---|---|
Is the Goal in the Trajectory? | NO | YES |
Uses TD Learning? | YES | NO |
- Goal from Trajectory?
- Given a transition HER creates a fictitious transition by choosing an arbitrary goal and updating the reward as per the goal. The goal doesn't have to be in the trajectory
- 00:10:57 GCSL only relables the transition goal to be the final state of the trajectory
- TD Learning? 00:11:21
- HER uses TD Learning (for learning value function) which is unstable
GCSL directly learns policy using Supervised Learning: Imitation Learning is stable
So even if we replace the goal in HER to be terminal state of trajectory, learning value function is not as stable as learning policy directly