HRL with Timed Subgoals
Table of Contents
[pdf][environments][arXiv]
Concurrent learning problem:
- necessary to learn concurrently on all levels
- but changing behaviour of lower level training is a difficult. This is equivalent to an non-stationary environment
- So, higher level won't learn until lower level has converged.
Solution (when subtask is to reach a goal state):
- In hindsight replace the subgoal chosen by higher level (Hindsight action relabling)
- But this makes the environment semi-markov decision process because time taken by lower level to reach the subgoal decreases as training progresses.
Demerit:
- An example: Say the agent has to play tennis, where timing is important, then when lower level policy learns to reach goal state faster, higher level policy has to specify longer sequence of subgoals to maintain correct timing.
- This happened because the environment has dynamic elements beyond the control of the agent (unlike, say, the ant maze problem).
So, the higher level policy is conditioned on the goal state as well as the dynamic component of environment state. That dynamic state is not in control of the policy. But when the action of agents don't alter the environment, or alter it on very few occassions, we can represent that dynamic state with time elapsed. [pg. 5]
So a better solution is to let the higher level choose what goal state
to reach as well as when, i.e. timed subgoals
.
1. Implementation Details
- Higher level policy recieves cummulative environment reward plus a penalty for emitting a subgoal (3.1.1.).
- Hindsight action relabling is not always done:
- When the desired subgoal is reached ie. \(d(\hat{g}, g)=0\)
- Testing Transitions: For some fixed percentage of subgoals, the lower level is made to act its best behaviour (mean action insted of sampling from action distribution) and if it fails to reach the goal the higher level is penalized (3.1.3). This lets the higher level know about the current capabilities of the lower level assign feasible subgoals.
2. Test Environments
Three new environmentes (Detailed description of problems is in Page 7):
- Platforms: agent has to trigger the movement of the platform at correct time
- Drawbridge: boat has to unfurl its sail at the correct time
- Tennis2D: robot arm has to return the ball to varying goal region
Four old environments:
- UR5Reacher
- AntFourRooms
- Pendulum
- Ball in cup