2026-03-05

SO3 Action Representation for Deep RL

Table of Contents

[pdf][arXiv]

SO3 is the mainfold of 3D rotations. The challenges is that the underlying manifold is curved and there exists no minimal parameterization that maps \(\mathbb R^3\) to SO3 that is globally smooth, bijectiva and non-singular.

1. Representations

There is different representations for actions in SO3:

  1. Euler angles

    Three angles that represent rotation. The problem with this is that the mapping is not smooth (due to angle wraping) and has singularities (e.g. gimbal lock).

  2. Quaternions

    This uses 4 numbers with the constraint that \(||q||_2 = 1\). The mapping is smooth and has no singularities but it is not unique because \(q\) and \(-q\) represent the same rotation.

  3. Rotation matrices

    These are 3x3 orthonormal matrices and thus are heavily overparameterized. The good thing is they are smooth and unique.

  4. Tangent space

    Angles in tangent space are locally smooth but have singularities at large angles. And they are not unique because the exponential map wraps around infinitely many times along each axis.

In any case, any minimal (i.e. \(\mathbb R^3\)) chart must have sinuglarities and global parameterization that avoid singularities are redundant and constrained.

Actions can be represented as delta actions relative to the current pose/rotation. So, we have 4x2=8 possible action representation.

In experiments the effect of the actions are restricted to a maximum amount because for physical systmes such as robot arms there are bounds on angular rate of movement.

2. Findings

The researchers try to see the effect of action representation on different aspect of deep RL. They try three algorithms (PPO, SAC, and TD3) in a toy environment and generate some conclusions:

  1. Uniqueness and smoothness help. But for restricted the angular change what matters is that discontinuities are out of reach of the action space.
  2. Representation affect exploration. For example Gaussian random euler angles are concentrated along some rotations. Euler and quaternions are most affected, matrices only to some degree and local tangent space the least. Designing noise distribution that take into account the projection of each action representation is difficult. See Figure 7.
  3. Entropy reularization tries to increase action magnitude but fails to increase diversity. Again due to projections warping random distributions. For dense reward this seemed less of a problem.
  4. Zero-centered delta actions result to large spread of rotation for quaternions and rotation matrices. Centering delta action to unit quaternion and identity rotation improved performance for PPO. For SAC and TD3 not much change.
  5. Scaling action to the range of permissible rotations improves performance and stability because the policy doesn't need to learn that rotations with same direction and magnitude larger than some threshold lead to same effect.

3. Practical Suggestions

The authors themselves give a good TLDR (https://amacati.github.io/so3_primer/):

Use tangent vectors (i.e. axis-angles) in the local frame

  • Default choice: Delta tangent vectors in the local frame. Scale outputs to the range of permissible rotations.
  • Dense rewards help: Continuous feedback can mask representation issues. Sparse rewards amplify differences.
  • For unstable systems (e.g. drones): Tangent vectors remain the best choise. If using matrices/quaternions, use delta actions and unit-centering. For limited operation ranges, Euler angles can be viable.
  • Fixed target poses: If your task involves reaching fixed target poses (not relative positioning), matrices or quaternions in the global frame may match or beat deltas.
  • Avoid Euler angles for general tasks: Delta Euler angles work for small rotations but degrade as coverage of SO(3) increases.

4. Discussions

Conflicting policy gradients:

When uni-modal policy parameterization is used, the double cover in quaternions and partial overlap in tangent space create conflicting gradients. E.g. for quaternions both \(q\) and \(-q\) can be optimal action producing gradients in opposite direction and harming the learning process.


Backlinks


You can send your feedback, queries here