Hierarchical Reinforcement Learning

1. Environments

Benefits of HRL:

Sample Efficiency
Scalability (Long horizon tasks)
Sparse Rewards
Generalization and Transfer Learning
Structured Exploration
Interpretability

Challenges:

Performance Variability
Training Stability - because for higher levels the the lower layers are like non-stationary environment
Option (Subtask) Discovery Problem
Benchmarking Challenge: Lack of recognized standardized tools/benchmarks to efficiently measure progress in HRL

Papers:

[pdf] HRL with Timed Subgoals
[pdf] Hierarchical Actor Critic

1. Environments

Google Research Football (GRF)

1.1. Environments in HRL with Timed Subgoals

Three new environmentes are introduced in the paper HRL with Timed Subgoals (Detailed description of problems is in Page 7):

Platforms: agent has to trigger the movement of the platform at correct time
Drawbridge: boat has to unfurl its sail at the correct time
Tennis2D: robot arm has to return the ball to varying goal region

Four old environments:

UR5Reacher
Ant Four Rooms
Pendulum
Ball in cup

1.2. MuJuCo Suite

Challenges: Partial Observability, Sparse Reward, Continuous Control

Environments:

Ant Four Rooms
Ant Maze
Key-Lock

Performance Comparision:

Quadruped - Ant: HRL learn twices as fast as flat

LIDOSS 2-levels , HAC 2-levels , AdInfoHRL showed superior performance
Bipedal - HalfCheetah, Hopper: Comparable performance

PPO outpoerformed other methods. AdInfoHRL was comparable to TD3

So, benefits of HRL is greater where the morphology is complex.

Papers:

LIDOSS: End-to-End Hierarchical Reinforcement Learning With Integrated Subgoal Discovery [researchgate.net - IEEE]
AdInfoHRL: Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization [arXiv - ICLR]

1.3. Atari

Challenges: Sparse reward, long horizon planing:

Environments:

Montezuma's Revenge
Ms. Pac-Man
Space Invaders

Variations:

Atari 100k [See Leaderboard on paperswithcode.com]
Atari 10k
- AXIOM paper claims to have learned Atari games in 10k and with just 2h of training

Performance Comparision:

Methods: TEMPLE + PPO, Dynamic HRL (dHRL), MaxQ-Q

Montezuma's Revenge: Go-Explore (not HRL but related to HRL method) does very good while flat RL struggles
Ms. Pac-Man: Hybrid Reward Architecture (HRA) does well

Papers:

TEMPLE: Temporal-adaptive Hierarchical Reinforcement Learning [arXiv]
dHRL: Hierarchical Reinforcement Learning for Playing a Dynamic Dungeon Crawler Game [ai.rug.nl - IEEE]
MaxQ-Q: Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition [arXiv]
Go-Explore: a New Approach for Hard-Exploration Problems [arXiv]

Published in Nature as "First return, then explore" [nature.com]
HRA: Hybrid Reward Architecture for Reinforcement Learning [arXiv - Microsoft][blog][YT]
AXIOM: Learning to Play Games in Minutes with Expanding Object-Centric Models [arXiv]

1.4. MiniGrid

Challenges: Large state space, Spares rewards, Long term reasoning & exploration

Environments:

MiniGrid-Empty,
Four Rooms,
Nine Rooms,
DoorKey

Performance Comparision: Methods: Decoupled HRL (DcHRL-SA), HplanPPO, HRM,

HRL methods do significant better than PPO. DcHRL-SA in DoorKey, and HplanPPO in 4 Rooms, 9 Rooms with Locked Doors.

HRL's ability to decompose task into subgoals, each with denser or intrinsic rewards provides more learning signals. This reward shaping combined with exploration at higher abstraction make HRL better at navigating vast state spaces more efficiently.

Papers:

DcHRL-SA: Decoupled Hierarchical Reinforcement Learning with State Abstraction for Discrete Grids [arXiv]
HplanPPO: Hierarchical Reinforcement Learning with AI Planning Models [arXiv]
HRM: Hierarchies of Reward Machines [arXiv]

Backlinks

Reinforcement Learning

Hierarchical Reinforcement Learning

Table of Contents

1. Environments

1.1. Environments in HRL with Timed Subgoals

1.2. MuJuCo Suite

1.3. Atari

1.4. MiniGrid

Backlinks