HRL with Timed Subgoals

1. Implementation Details
2. Test Environments

Concurrent learning problem:

necessary to learn concurrently on all levels
but changing behaviour of lower level training is a difficult. This is equivalent to an non-stationary environment
So, higher level won't learn until lower level has converged.

Solution (when subtask is to reach a goal state):

In hindsight replace the subgoal chosen by higher level (Hindsight action relabling)
But this makes the environment semi-markov decision process because time taken by lower level to reach the subgoal decreases as training progresses.

Demerit:

An example: Say the agent has to play tennis, where timing is important, then when lower level policy learns to reach goal state faster, higher level policy has to specify longer sequence of subgoals to maintain correct timing.
This happened because the environment has dynamic elements beyond the control of the agent (unlike, say, the ant maze problem).

So, the higher level policy is conditioned on the goal state as well as the dynamic component of environment state. That dynamic state is not in control of the policy. But when the action of agents don't alter the environment, or alter it on very few occassions, we can represent that dynamic state with time elapsed. [pg. 5]

So a better solution is to let the higher level choose what goal state to reach as well as when, i.e. timed subgoals.

1. Implementation Details

Higher level policy recieves cummulative environment reward plus a penalty for emitting a subgoal (3.1.1.).
Hindsight action relabling is not always done:
1. When the desired subgoal is reached ie. \(d(\hat{g}, g)=0\)
2. Testing Transitions: For some fixed percentage of subgoals, the lower level is made to act its best behaviour (mean action insted of sampling from action distribution) and if it fails to reach the goal the higher level is penalized (3.1.3). This lets the higher level know about the current capabilities of the lower level assign feasible subgoals.

2. Test Environments

Three new environmentes (Detailed description of problems is in Page 7):

Platforms: agent has to trigger the movement of the platform at correct time
Drawbridge: boat has to unfurl its sail at the correct time
Tennis2D: robot arm has to return the ball to varying goal region

Four old environments:

UR5Reacher
AntFourRooms
Pendulum
Ball in cup

HRL with Timed Subgoals

Table of Contents

1. Implementation Details

2. Test Environments

Backlinks