2025-07-12

Hierarchical Reinforcement Learning

Table of Contents

Benefits of HRL:

Challenges:

Papers:

See also:

1. Environments

1.1. Environments in HRL with Timed Subgoals

Three new environmentes are introduced in the paper HRL with Timed Subgoals (Detailed description of problems is in Page 7):

  • Platforms: agent has to trigger the movement of the platform at correct time
  • Drawbridge: boat has to unfurl its sail at the correct time
  • Tennis2D: robot arm has to return the ball to varying goal region

Four old environments:

  • UR5Reacher
  • Ant Four Rooms
  • Pendulum
  • Ball in cup

1.2. MuJuCo Suite

Challenges: Partial Observability, Sparse Reward, Continuous Control

Environments:

  • Ant Four Rooms
  • Ant Maze
  • Key-Lock

Performance Comparision:

  • Quadruped - Ant: HRL learn twices as fast as flat

    LIDOSS 2-levels , HAC 2-levels , AdInfoHRL showed superior performance

  • Bipedal - HalfCheetah, Hopper: Comparable performance

    PPO outpoerformed other methods. AdInfoHRL was comparable to TD3

So, benefits of HRL is greater where the morphology is complex.

Papers:

  • LIDOSS: End-to-End Hierarchical Reinforcement Learning With Integrated Subgoal Discovery [researchgate.net - IEEE]
  • AdInfoHRL: Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization [arXiv - ICLR]

1.3. Atari

Challenges: Sparse reward, long horizon planing:

Environments:

  • Montezuma's Revenge
  • Ms. Pac-Man
  • Space Invaders

Variations:

Performance Comparision:

Methods: TEMPLE + PPO, Dynamic HRL (dHRL), MaxQ-Q

  • Montezuma's Revenge: Go-Explore (not HRL but related to HRL method) does very good while flat RL struggles
  • Ms. Pac-Man: Hybrid Reward Architecture (HRA) does well

Papers:

  • TEMPLE: Temporal-adaptive Hierarchical Reinforcement Learning [arXiv]
  • dHRL: Hierarchical Reinforcement Learning for Playing a Dynamic Dungeon Crawler Game [ai.rug.nl - IEEE]
  • MaxQ-Q: Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition [arXiv]
  • Go-Explore: a New Approach for Hard-Exploration Problems [arXiv]

    Published in Nature as "First return, then explore" [nature.com]

  • HRA: Hybrid Reward Architecture for Reinforcement Learning [arXiv - Microsoft][blog][YT]
  • AXIOM: Learning to Play Games in Minutes with Expanding Object-Centric Models [arXiv]

1.4. MiniGrid

Challenges: Large state space, Spares rewards, Long term reasoning & exploration

Environments:

  • MiniGrid-Empty,
  • Four Rooms,
  • Nine Rooms,
  • DoorKey

Performance Comparision: Methods: Decoupled HRL (DcHRL-SA), HplanPPO, HRM,

HRL methods do significant better than PPO. DcHRL-SA in DoorKey, and HplanPPO in 4 Rooms, 9 Rooms with Locked Doors.

HRL's ability to decompose task into subgoals, each with denser or intrinsic rewards provides more learning signals. This reward shaping combined with exploration at higher abstraction make HRL better at navigating vast state spaces more efficiently.

Papers:

  • DcHRL-SA: Decoupled Hierarchical Reinforcement Learning with State Abstraction for Discrete Grids [arXiv]
  • HplanPPO: Hierarchical Reinforcement Learning with AI Planning Models [arXiv]
  • HRM: Hierarchies of Reward Machines [arXiv]

Backlinks


You can send your feedback, queries here