Hierarchical Reinforcement Learning
Table of Contents
Benefits of HRL:
- Sample Efficiency
- Scalability (Long horizon tasks)
- Sparse Rewards
- Generalization and Transfer Learning
- Structured Exploration
- Interpretability
Challenges:
- Performance Variability
- Training Stability - because for higher levels the the lower layers are like non-stationary environment
- Option (Subtask) Discovery Problem
- Benchmarking Challenge: Lack of recognized standardized tools/benchmarks to efficiently measure progress in HRL
Papers:
- [pdf] HRL with Timed Subgoals
- [pdf] Hierarchical Actor Critic
See also:
- Horizon Reduction is needed to scale RL
- Action Chunking
- Recurrent World Models
- Agents are World Model (aka General agents need world model)
- Actions are infinite at the level of planning.
1. Environments
1.1. Environments in HRL with Timed Subgoals
Three new environmentes are introduced in the paper HRL with Timed Subgoals (Detailed description of problems is in Page 7):
- Platforms: agent has to trigger the movement of the platform at correct time
- Drawbridge: boat has to unfurl its sail at the correct time
- Tennis2D: robot arm has to return the ball to varying goal region
Four old environments:
- UR5Reacher
- Ant Four Rooms
- Pendulum
- Ball in cup
1.2. MuJuCo Suite
Challenges: Partial Observability, Sparse Reward, Continuous Control
Environments:
- Ant Four Rooms
- Ant Maze
- Key-Lock
Performance Comparision:
Quadruped - Ant: HRL learn twices as fast as flat
LIDOSS 2-levels , HAC 2-levels , AdInfoHRL showed superior performance
Bipedal - HalfCheetah, Hopper: Comparable performance
PPO outpoerformed other methods. AdInfoHRL was comparable to TD3
So, benefits of HRL is greater where the morphology is complex.
Papers:
- LIDOSS: End-to-End Hierarchical Reinforcement Learning With Integrated Subgoal Discovery [researchgate.net - IEEE]
- AdInfoHRL: Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization [arXiv - ICLR]
1.3. Atari
Challenges: Sparse reward, long horizon planing:
Environments:
- Montezuma's Revenge
- Ms. Pac-Man
- Space Invaders
Variations:
- Atari 100k [See Leaderboard on paperswithcode.com]
- Atari 10k
- AXIOM paper claims to have learned Atari games in 10k and with just 2h of training
Performance Comparision:
Methods: TEMPLE + PPO, Dynamic HRL (dHRL), MaxQ-Q
- Montezuma's Revenge: Go-Explore (not HRL but related to HRL method) does very good while flat RL struggles
- Ms. Pac-Man: Hybrid Reward Architecture (HRA) does well
Papers:
- TEMPLE: Temporal-adaptive Hierarchical Reinforcement Learning [arXiv]
- dHRL: Hierarchical Reinforcement Learning for Playing a Dynamic Dungeon Crawler Game [ai.rug.nl - IEEE]
- MaxQ-Q: Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition [arXiv]
Go-Explore: a New Approach for Hard-Exploration Problems [arXiv]
Published in Nature as "First return, then explore" [nature.com]
- HRA: Hybrid Reward Architecture for Reinforcement Learning [arXiv - Microsoft][blog][YT]
- AXIOM: Learning to Play Games in Minutes with Expanding Object-Centric Models [arXiv]
1.4. MiniGrid
Challenges: Large state space, Spares rewards, Long term reasoning & exploration
Environments:
- MiniGrid-Empty,
- Four Rooms,
- Nine Rooms,
- DoorKey
Performance Comparision: Methods: Decoupled HRL (DcHRL-SA), HplanPPO, HRM,
HRL methods do significant better than PPO. DcHRL-SA in DoorKey, and HplanPPO in 4 Rooms, 9 Rooms with Locked Doors.
HRL's ability to decompose task into subgoals, each with denser or intrinsic rewards provides more learning signals. This reward shaping combined with exploration at higher abstraction make HRL better at navigating vast state spaces more efficiently.
Papers: