2025-07-16

HRL with Timed Subgoals

Table of Contents

[pdf][environments][arXiv]

Concurrent learning problem:

Solution (when subtask is to reach a goal state):

Demerit:

So, the higher level policy is conditioned on the goal state as well as the dynamic component of environment state. That dynamic state is not in control of the policy. But when the action of agents don't alter the environment, or alter it on very few occassions, we can represent that dynamic state with time elapsed. [pg. 5]

So a better solution is to let the higher level choose what goal state to reach as well as when, i.e. timed subgoals.

1. Implementation Details

  • Higher level policy recieves cummulative environment reward plus a penalty for emitting a subgoal (3.1.1.).
  • Hindsight action relabling is not always done:
    1. When the desired subgoal is reached ie. \(d(\hat{g}, g)=0\)
    2. Testing Transitions: For some fixed percentage of subgoals, the lower level is made to act its best behaviour (mean action insted of sampling from action distribution) and if it fails to reach the goal the higher level is penalized (3.1.3). This lets the higher level know about the current capabilities of the lower level assign feasible subgoals.

2. Test Environments

Three new environmentes (Detailed description of problems is in Page 7):

  • Platforms: agent has to trigger the movement of the platform at correct time
  • Drawbridge: boat has to unfurl its sail at the correct time
  • Tennis2D: robot arm has to return the ball to varying goal region

Four old environments:

  • UR5Reacher
  • AntFourRooms
  • Pendulum
  • Ball in cup

Backlinks


You can send your feedback, queries here