Optimization on Manifolds

1. Optimization in Manifold
2. Embedding makes optimization difficult
3. Intrinsic Approach
- 3.1. Gradient Descent
4. Relevant Mathematics
5. Optimization on Unit Sphere
- 5.1. Computing gradient
- 5.2. Computing Retraction
  - 5.2.1. Exponential Map
  - 5.2.2. Projection
6. Optimization on Stiefel Manifold
7. Reyleigh Quotient Minimization
8. Different Variations of PCA as objective functions

MIT EECS 6838

1. Optimization in Manifold

A common optimization problem is:

\begin{equation*} \min_{x \in \mathcal{M}} f(x) \end{equation*}

where the manifold \(\mathcal{M}\) can be

Euclidean Space \(\mathbb{R}^n\)
Unit Sphere \(S^{n-1}\)
Rotation Group \(SO(n)\)

e.g. image of protein frozen in a solution will give sample of protein images from different rotation. And an optimization probelm in terms of Rotation Group can be used to reconstruct the full 3D structure of protein. (00:04:08 Cryo-EM)

2. Embedding makes optimization difficult

# A typical approach is to embed the manifold in a larger space, and write the problem as a constrained optimization problem.

E.g.

\begin{equation*} \min_{R \in SO(3)} f(R) \to \min_{R^TR = I_3; \det(R) = 1} f(R) \end{equation*}

# But embedding it to \(\mathbb{R}^{3 \times 3}\) increases the difficulty of the problem. The constrain cut out a nonconvex, curved set in the embedding space.

using_constraints_makes_optimization_difficult-20240926113048.png

Figure 1: Using constraints makes optimization difficult

# Also optimization this way is not straightforward. The objective wants to move to one direction but the constraint wants to keep the variables in another direction.

# We could do step and project. But projecting a point in to a complex constraint is not always straightforward.

3. Intrinsic Approach

# Optimize the problem in the intrinsic space of the Manifold.

3.1. Gradient Descent

\begin{equation*} x_{k + 1} = x_k - \alpha_k \nabla f(x_k) \end{equation*}

There are two main aspect to Gradient Descent:

Gradient direction \(\nabla f(x_k)\)
Walking along the domain (i.e. taking the difference)

On a Sphere, we can take derivative (i.e. find tangent) using laplace operator. (?)

# In a Sphere, an exponential map moves a point on the Sphere. It takes a point on the manifold and a tangent vector at that point. And moves it along a geodesic.

This allows us to use gradient descent like algorithm in an intrinsic way.

\begin{equation*} x_{k+1} = \exp_{x_k}(-\alpha_k \nabla f(x_k)) \end{equation*}

Figure 2: First-Order Manifold Optimization

# The benefits of this approach is:

Better alogrithms
- Automatic constraint satisfaction
- specialized to the space
Theoretical perspective:
- Generalize convexity
- Gradient Descent and other methods have analogues

# Books and Resources:

Optimization Algorithms on Matrix Manifolds
https://manopt.org
An introduction to Optimization on Smooth Manifolds (https://nicolasboumal.net/#book)

# Gradient descent is a first order algorithm. So instead of the difficult to compute Exponential Map, we can use a Retraction Map instead which is same as Exponential Map only upto first order.

4. Relevant Mathematics

4.1. Differential

# Analogue of derivative

\begin{equation*} df_{x_0} (v) = \lim_{h \to 0} \frac {f(x_0 + h v) - f(x_0)} h \end{equation*} \begin{equation*} df_{x_0}(v) = Df(x_0) \cdot v \end{equation*}

4.2. Tangent Space

# Set of tangent vectors at a point.

4.3. Exponential Map

\begin{equation*} \exp_p(v) = \gamma_v(1) \end{equation*}

where \(\gamma_v\) is (unique) geodesic from \(p\) with velocity \(v\).

4.4. Retraction

# In general the geodesics curve are difficult to compute. But we don't always need a perfect geodesic.

We can replace the Exponential Map with a Retraction. Which is a weaker notion of exponential map. They are same only upto first order.

Figure 3: Retraction

4.5. Riemannain Manifold

# Pair \((M,g)\) of a differentiable manifold \(M\) and a pointwise positive definite inner product per point.

\(g_p(\cdot, \cdot): T_p M \times T_p M \to \mathbb{R}\)

\(g_p\) is symmetric, bilinear, & positive definite form.

4.6. Riemannian Gradient

Metric tensor \(g \in \mathbb{R}^{n \times n}\)
Gradient in coordinates \(\nabla f \in \mathbb{R}^n\)
\begin{equation*} \nabla_g f = g^{-1} \nabla f \end{equation*}

\(\phi\) is maps a parametrization of \(M\) in \(\mathbb{R}^n\) to \(M\)

Figure 4: Gradient in Riemannian Manifold

5. Optimization on Unit Sphere

# As an example of intrinsic approach to optimization lets see how we can do optimization on a Unit Sphere.

\begin{equation*} S^{n-1} = {x \in \mathbb{R}^n : ||x||_2 = 1} \end{equation*}

Tangent Space of Sphere at point \(p\) is the space of points that are orthogonal to point \(p\):

\begin{equation*} T_p S^{n-1} = { v \in \mathbb{R}^n: v \cdot p = 0} \end{equation*}

Our optimation problem is:

\begin{equation*} \min_x x^T A x \end{equation*} \begin{equation*} x \in S^{n-1} \end{equation*}

To do optimization by gradient descent/ascent, we need two things:

A way to compute gradient i.e. direction
A way to take step in that direction

5.1. Computing gradient

# We can view the objective function on the sphere as a restriction of an objective function on all of \(\mathbb{R}^n\).

In this case (Sphere) if we want the intrinsic gradient of the function \(f\), we can compute the extrinsic gradient in \(\mathbb{R}^n\) and project it to the tangent space.

\begin{equation*} \nabla_{S^{n-1}} f (p) = (I_{n \times n} - pp^T) \nabla_{\mathbb{R}^n} f(p) \end{equation*}

5.2. Computing Retraction

# For gradient descent, we need a gradient in the intrinsic space and also a retraction. In this case there are two (typical) options for a retraction: Exponential Map and Projection.

5.2.1. Exponential Map

\begin{equation*} \exp_p(v) = p \cos||v||_2 + \frac {v \sin ||v||_2} {||v||_2} \end{equation*}

This function maps a direction to a point on the sphere that is a \(||v||\) distance from \(p\) along the geodesic on that direction.

5.2.2. Projection

\begin{equation*} R_p(v) = \frac {p + v} {||p+v||_2} \end{equation*}

Take a gradient step off of the sphere and project it back orthogonally.

\(R_p(v)\) and \(\exp_p(v)\) don't match when \(v\) is large, but they are same locally.

6. Optimization on Stiefel Manifold

\begin{equation*} V_k(\mathbb{R}^n) = \{ X \in \mathbb{R}^{n \times k}: X^T X = I_{k \times k} \} \end{equation*}

A manifold of matrices with \(k\) orthonormal columns.

If \(k=1\) this is same as \(S^{n-1}\).

# For manifold optimization we need to define:

Tangent Space
Projection on the tangent space
Retraction

tangent_space_and_retraction_for_stiefel_manifold-20240929135414.png

Figure 5: Tangent Space and Retraction for Stiefel Manifold

7. Reyleigh Quotient Minimization

\begin{equation*} \min_{x \in S^{n-1}} \frac 1 2 x^T A x \end{equation*}

# This is equivalent to computing the eigenvector of the mininum eigenvalue of \(A\), as indicated below:

\begin{equation*} \Lambda(x; \lambda) = \frac 1 2 x^T A x + \lambda (\frac 1 2 - \frac 1 2 ||x||^2_2) \end{equation*} \begin{equation*} 0 = \nabla_x \Lambda = Ax - \lambda x \implies Ax = \lambda x \end{equation*}

We then define then tangent space and reduction map and then iterate to find the solution.

Also, to find the eigenvector of any other eigenvalue, we can formulate another problem:

\begin{equation*} \widetilde{f}(x) = ||\nabla_{S^{n-1}} (\frac 1 2 x^T A x)||_2^2 \end{equation*}

Because the gradient is zero at eigenvector location. And the l2 norm makes those zero value the minima of \(\widetilde{f}\).

8. Different Variations of PCA as objective functions

# Optimizing on Stiefel Manifold corresponds to computing PCA.

Figure 6: Table of PCA Variations

# For each other variants of PCA, we could figure out a custom optimizer. Or we could just use the generic gradient descent algorithm in the manifold of PCA with the objective functions regularized in various ways.

References

https://www.youtube.com/watch?v=2mnqWES2roQ