2026-01-12

Optimization and Learning via Stochastic Gradient Search

Table of Contents

[pdf][press.princeton.edu]

These are my notes for the book "Optimization and Learning via Stochastic Gradient Search" by Bernd Heidergott and Felisa Vázquez-Abad.

I am still going through the book.

1. Gradient Based Optimization

We only consider the following:

  • continuous optimization
  • deterministic methods

1.1. Unconstrained Optimization

Problem:

The objective is to optimize the cost function \(J(\theta): \mathbb R^d \to \mathbb R\) and also find the point \(\theta\) that minimizes the cost function.

If \(J(\theta)\) is linear (affine) it is called linear problem, else it is called Non-Linear problem (NLP)

Learning is an special kind of optimization problem, where we have a input data vector \(x\) and the cost function is expressible as:

\begin{align*} J(\theta, x) = ||f(\theta, x) - h(x)||^2 \end{align*}

Basic terminology:

  1. Level sets: Sets of points \(\theta\) which result to same value \(\mathcal L_{\alpha}(J) = \{\theta: J(\theta) = \alpha \}\)
  2. Convex function: If for any two points (A, B) within the function, all points in line segment AB also fall inside the function, then the function is convex.
  3. Stationary point: The points that satisfy \(\nabla J(\theta) = 0\)
  4. First order optimality condition: Local minima/maxima must be a stationary point
  5. Second order optimality condition: Hessian must be positive for stationary point to be local minima

Solution algorithms:

If we can solve \(\nabla J(\theta) = 0\), then we can evaluate the points and find the minima.

This is rarely feasbile analytically for problems of our interest. Thus use iterative algorithm to approximate the root, using recursion of the form:

\begin{align*} \theta_{n+1} = \theta_n + \epsilon_n d(\theta_n) \end{align*}

Different algorithms differ in their choice of

  • step size (aka gainsize, learning rate) \(\epsilon_n\)
  • the direction \(d(\theta_n)\)
  • starting point \(\theta_0\) and stopping rule

1.1.1. Newton-Raphson Method

\begin{align*} \theta_{n+1} = \theta_n - [\nabla^2 J(\theta) ]^{-1} \nabla J(\theta_n) \end{align*}
  • Step size is adaptive: \(\epsilon_n = [\nabla^2 J(\theta) ]\)
  • In neighbourhood of local minima, the Hessian \(\nabla^2 J(\theta)\) is positive definite and thus:

    1. Hessian is invertible
    2. Multiplying the negative gradient direction by hessian still gives a descent

    direction [Lemma 1.1]

Benefits:

  • Quadratic convergence

Problems:

  • Hessian may not be invertible at every point
  • Zeros can be inflection point
  • Computation at each step is expensive (compute gradients, hessian, inversion of hessian)

Quasi-newton methods attempt to solve those problems. E.g. Barzilai-Borwein method used in Gradient Projection for Sparse Reconstruction paper.

1.1.2. Cauchy's Method - Steepest Descent

At each step move along the gradient direction until you reach minima on that line.

\begin{align*} \epsilon_n = \arg \min_{\epsilon > 0} J(\theta_n - \epsilon \nabla J(\theta)) \end{align*}

cauchy-steepest-descent.png

Figure 1: Cauchy's Steepest Descent method used on Himmelblau function. Source: indrag49.github.io/Numerical-Optimization/

1.1.3. Non-adaptive step size

Non-adaptive step size means step size is independent of the current parameter \(\theta\). The step size is either constant or decreasing.

There are theorems for convergence of the gradient descent algorithm. In the proof of these theorems (Theorem 1.3, Theorem 1.4) on crucial relation (lets call Descent Lemma) is used:

\begin{align*} J(\theta_{i+1}) < J(\theta_i) - \epsilon_i \left( 1 - \frac 1 2 L \epsilon_i \right) ||\nabla J (\theta_i)||^2 \end{align*}

So, when the step size is small \(\epsilon_i < 2 / L\) then the cost function decreases with each step.

  • For decreasing step size, after sufficiently large \(n\) the step size will satisfy the relation and the algorithm converges.
  • For fixed step size, we explicity require this relation to be satisfied.

1.1.4. Decreasing step size

If the descent direction \(d(\theta)\) is bounded then we need following for our algorithm to be able to explore all of the parameter space:

\begin{align*} \sum_{n=1}^\infty \epsilon_n = \infty \end{align*}
1.1.4.1. Bounded gradient

Theorem 1.3 [# 26] give a sufficient condition under which sequence obtained via a gradient descent algorithm finds a stationary point. The conditions are:

  • Update is gradient descent: \(\theta_{n+1} = \theta_n - \epsilon_n \nabla J(\theta_n)\)
  • Gradient \(\nabla J\) is lipschitz continuous
  • Gradient is bounded at evaluation points. i.e. \(\{|| \nabla J(\theta_n) ||: n \geq 0\}\) is bounded
  • Gain sequences satisfy:
    1. \(\epsilon_n > 0\)
    2. \(\sum \epsilon_n = + \infty\)
    3. \(\sum \epsilon^2_n < \infty\)

Then the algorithm either

  • converges to a stationary point of \(J(\theta)\)
  • or at least gives a strictly monotone decreasing sequence of \(J(\theta_n)\) after sufficiently large \(n\).

    This of it as the stationary point being at infinity. See Figure 1.6, the vanishing gradient on the right side leads to infinity, while if we start at \(\theta < 0\) we converget to the minima.

The problem with this theorem is that, the boundedness of the gradient is not straightforward to check.

1.1.4.2. Biased graident estimate

Lemma 1.3 [# 29] is an extension of Theorem 1.3, when we don't have exact gradients but a biased estimate of the gradient \(\nabla J(\theta_n) + \beta_n(\theta_n)\). If the bias terms goes zero \(\beta_n(\theta_n) \to 0\), the we have the result of Theorem 1.3.

If the bias doesn't go to zero but to a fixed constant \(\beta\) then the algorithm converges to the solution of an adjusted objective \(J(\theta) + \beta \theta\).

1.1.4.3. Finite difference estimate

Example 1.6 [# 29] If we can't compute the gradient but can evaluate the cost function at any point, the we can use finite difference approximation of the graident. A centeral difference estimate would be:

\begin{align*} \frac {J(\theta_n + c_n) - J(\theta_n - c_n)} {2 c_n} = \nabla J(\theta_n) + \beta_n(\theta_n, c_n) \end{align*}

This has a bias term that depends on \(\theta\) and the radius \(c_n\).

The algorithm converges if:

\begin{align*} \sum \epsilon_n \beta_n < \infty \end{align*}

This also requires that the bias term goes to 0, i.e. \(\beta(\theta_n, c_n) \to 0\). And the bias terms goes to zero if

  • \(J'''(\theta_n)\) is uniformly bounded within the points \(\theta_n\) we visit.
  • and \(c_n \to 0\)

1.1.5. Fixed step size

Theorem 1.4 [# 30] If the step size if fixed, just like Theorem 1.3 gradient descent converges to a stationary point or leads to a monotone decreasing sequence, when the following conditions are met:

  1. The gradient is Lipschitz continuous with Lipschitz constant \(L\)
  2. The step size satisfies \(0 < \epsilon < 2/L\)

Theorem 1.4 has extensions similar to Lemma 1.3. When the gradient is biased the algorithm converges to solution of \(\nabla J(\theta) + \beta = 0\) i.e. solution of adjusted objective \(J(\theta) + \beta \theta\)

(where \(\beta\) is the limit of the bias \(\lim \beta_n \to \beta\)).

1.2. Constrained Optimization

Problem:

The objective is to optimize the cost function \(J(\theta)\) subject to:

  1. inequality constraints \(g_i(\theta) \leq 0\)
  2. equality constraints \(h_i(\theta) = 0\)

Basic Terminology:

  • Convex problem: If \(J(\theta)\) and \(g_i(\theta)\) are convex and \(h_i(\theta)\) is affine then the NLP problem is convex problem.
  • Active inequality constraint: If \(g_i(\theta) = 0\) then the constraint \(g_i\) is said to be active, otherwise inactive.

Solution algorithms:

The main idea of solutions methods for constrained optimization problem is to reformulate the problem in terms of unconstrainted optimization.

1.2.1. Lagrangian formulation

The Lagrangian associated to this NLP is

\begin{align*} \mathcal L(\theta, \lambda, \eta) = J(\theta) + \lambda g(\theta) + \eta h(\theta) \end{align*}

Interpretaion of Lagrange multipliers:

The lagrange multipliers can be viewed as the rate of change of the optimal cost as the level of constraint changes. See Theorem 1.7. Practically, in economics they are called shadow prices, in physics they can represent concrete physical quantities.

Constraint qualification conditions:

For lagrangian formulation to be useful with gradient based methods, some "constraint qualification" conditions must be satisfied. These conditions ensure that lagrange multipliers satisfying KKT conditions exist at the satitionary point. In other words, without constraint qualification conditions, there can be stationary points (and local minimas) that don't satisfy KKT condition. The constraint qualificationc conditions are:

  1. The gradients of constraints \(\{ \nabla h_i (\theta); \nabla q_i(\theta)\}\) must be linear independent,
  2. And there must exist a direction where we can move without invalidating the constraints. i.e. \(v\) exists such that
    • \(\nabla h_i(\theta) v = 0\)
    • \(\nabla g_i(\theta) v < 0\) for all \(g_i\) that are active.

Karush Kuhn Tucker (KKT) condition:

If constraint qualification conditions hold at a local minima then the KKT conditions is satisfied. Those statitionary points which also satisfy KKT condition are called KKT points.

  1. The first order optimality condition for the original NLP:

    \begin{align*} \nabla_\theta \mathcal L(\theta, \lambda, \eta) = 0 \end{align*}
  2. The inequality constraint:

    \begin{align*} \nabla_\lambda \mathcal L(\theta, \lambda, \eta) = g(\theta) < 0 \end{align*}

    And complimentary slackness condition for inequality constraint:

    \begin{align*} \lambda_i g_i(\theta) = 0 \end{align*}

    Which means either the constraint \(g_i(\theta) < 0\) is inactive (requiring \(\lambda_i = 0\)) or the constraint itself is active \(g_i(\theta) = 0\).

  3. The equality constraint:

    \begin{align*} \nabla_\eta \mathcal L(\theta, \lambda, \eta) = h(\theta) = 0 \end{align*}

Second order condition:

The KKT condition provide the first order condition. For the KKT point of be a local minima there is a second order condition that must be satisfied. It states, that the Hessian of \(\mathcal L\) must be positive definite for those directions that don't violate the constraints. i.e. for directions \(v \in C(\theta^*, \lambda^*)\), we need

\begin{align*} v^T \nabla^2_\theta \mathcal L(\theta, \lambda, \eta) v > 0 \end{align*}

where, the set of direction \(C\) is defined as set of directions \(v\) that satisfy:

  1. \(\nabla h v = 0\)
  2. \(\nabla g v = 0\) if \(g\) is active, and \(\nabla g v < 0\) if \(g\) is inactive.

This is given in Theorem 1.6 [# 34]

1.2.2. Penalty Method

We solve the unconstrainted problem:

\begin{align*} J_\alpha (\theta) = J(\theta) + \frac {\alpha} 2 \left( || g(\theta)_+ ||^2 + ||h(\theta)||^2 \right) \end{align*}

where \(g_i(\theta)_+ = max(0, g_i(\theta))\)

Increase \(\alpha\) to get better solutions:

Theorem 1.8 [# 36] If we increase \(\alpha\) such that \(\alpha_n \to \infty\), then the limit of the optimal points \(\theta_n(\alpha_n)\) (if it exists) is a stationary point of the original constrained optimization problem.

And the lagrange multipliers are:

\begin{align*} \lambda = \lim_{n \to \infty} \alpha_n (g(\theta_n))_+, \eta = \lim_{n \to \infty} \alpha_n h(\theta_n) \end{align*}

Use gradient based method:

Theorem 1.8 assumes we solve each subproblem exactly, but practically we use gradient based method to solve each sub problem. For each sub problem we do \(T_n\) iterations (or until an stopping criteria), and initialize the next sub problem with solution of previous sub problem \(\theta^{n+1}_0 = \theta^n_{T_n}\).

Usually \(T_n\) is small at beginning and is increased later on.

Alternatively, we can use \(T_n = 1\) and increase \(\alpha_n\) very slowly. This a two-timescale variation of the penalty method. [# 38]

There is no gurantee of convergence for inexact minimization, i.e. iterative algorithm for both penalty based and multiplier methods.

1.2.3. Projection Method

At each step of gradient descent, project the solution \(\tilde \theta_{n+1}\) to the space where constraints are satisfied.

Theorem 1.9 [# 41] gives same gurantees as Theorem 1.3 and Theorem 1.4 give for unconstrained optimization using variable and fixed step sizes respectively.

Useful for theoritical analysis:

The exact computation of projection is often the most computationally expensive part. And although not feasbile for most problems, the projection method is very useful for theoritical analysis of algorithms. If projected version of the algorithm converges, then the unprojected version also converges [# 42 Remark 1.3]. And projecting on a large enough bounded set relaxes some of the constraints for theoritical results in unprojected version, e.g. the need for gradient being bounded is removed if we project to some bounded set.

For an example of projection method see Gradient Projection for Sparse Reconstruction paper. There they convert a unconstrained problem (minimize \(|x|\)) to constrained problem (minimze \(u - v\) with \(u,v \geq 0\)) and use projection to a hypercube to enforce the constraint.

1.2.4. Multiplier Methods

Solve the problem for a value of multiplier \(\eta_n\), then keep on increasing \(\eta\). The algorithm is:

\begin{align*} \theta_n = \arg \min_\theta \mathcal L (\theta, \eta_n) \\ \eta_{n+1} = \eta_n + \rho_n h(\theta_n) \end{align*}

where \(\rho_n \to \infty\), then the limit of the solution \((\theta_n, \eta_n) \to (\theta^*, \eta^*)\) leads to the local minima of original problem.

Instead of solving for \(\theta_n\) exactly, we can use gradient based method for inexact solution upto time step \(T_n\) like in penalty based method.

For both penalty method and multiplier methods there is no gurantee of convergence for inexact minimization.

1.2.5. Lagrange Duality Method

For each lagrange problem there is a dual problem expressed as a min max problem.

In Uzawa algorithm [# 45], we alternative between

  1. updating \(\theta\) that minimizes the lagrangian
  2. updating \(\lambda, \eta\) to increase the lagrangian

Instead of exact minimization wrt \(\theta\), Arrow-Hurwicz iterative algorithm alternates between:

  1. updating \(\theta\) to decrease the lagrangian (a gradient descent step)
  2. updating \(\lambda, \eta\) to increase the lagrangian (a gradient ascent step)

1.3. Practical Considerations

  1. Experiment to check if algorithm is good for the problem:

    When using an algorithm for a problem, get a feel for the algorithm. Run it with mulitple points and depending on the results get an idea if the problem is well posed for the algorithm. Don't get lost in the theoritical constraints, but when something goes off, an explanation is found by looking at those constraints which you couldn't prove. [# 47]

  2. Stepsize

    Can't be too large, can't be too small. Falls under hyperparameter optimization.

  3. Boundedness
  4. Soft vs Hard constraints

    Some constraints can be hard (e.g. J(θ) = 1 / \sqrt {θ + 2}) then, \(\theta\) can't be negative. In such case projection method work.

  5. Fewer constraints are better

    Sometimes we can see that some constraints are reduntant. Sometimes we can argue that some of the parameter components need to be at the edge of their domain. In those case remove constraints, simplify the problem.

    Sometimes you can reformulate (e.g. \(max_{\theta > 0} J(\theta) = max J(\theta^2)\), be beware that projecting back can lead to problem. Or the reformulation might make the problem ill posed (e.g. using sigmoid transform can cause vanishing gradients and prevent the parameter from taking extreme values)

  6. Convexity

    Can't be always easily checked. Sometimes randomized checks can be done to get a feel. Otherwise don't stress about it. Take multiple random initial points and look at the solutions.

2. Iterative Methods as ODE

Many iterative algorithms follow as general form which looks like Euler steps (or some other discretization) of ODE solver. Some condition on the vector field of the ODE lead to solution of ODE being same as result of deterministic algorithm. Thus difficult analysis of discrete iterative algorithms is converted to simpler analysis of continuous differential equations.

  • Interpolation process: \(\vartheta_\epsilon(t)\) is a continuous function interpolating the discrete solution \(\{\theta_i\}\) obtained with step size \(\epsilon\)

Stability of ODE:

  • Stable: If for any \(\epsilon\) I choose, I can find a smaller \(\delta\) ball around \(\bar x\) where I can start the ode, and I am guranteed to remain inside bigger \(\epsilon\) ball around \(\bar x\) then the ODE is stable in that region.

    i.e. there exists \(\delta > 0\) such that if \(||x(0) - \bar x|| < \delta\) (ie. we start within \(\delta\) distance of \(\bar x\)) then it implies we are forever within \(\epsilon\) distance of \(\bar x\)

\begin{align*} ||x(0) - \bar x|| < \delta \implies ||x(t) - x(0)|| < \epsilon \ \forall t \end{align*}

This is called Lyapunov Stability.

  • Asymptotically Stable: If a stable ODE converges to a fixed point as \(t \to \infty\). i.e. if we start around some region in the solution we get to the local minima eventually.
  • If the jacobian of gradient is Hurwitz around an equilibrium point then that point is asymptotically stable.

    Being Hurwitz means that all eigenvalues are have negative real part. This relates to the Hessian of the cost function being positive definite.

Convergence:

  • Example 2.8: ODE driven by descent direction \(G\) traces the stationary point of \(J(\theta)\). And the optimal point is the globally asympototically stable point of \(G\). This assumes a single stationary point of \(J(\theta)\).
  • Theorems establish that interpolated difference equation converges in the sup-norm to solution of an ODE. Independent of if projected, penalty or multiplier method is used.

    Thus using ODE to study iterative algorithms is justified.

Properties of ODE & Iterative Algorithms:

  • A well-posed NLP is one where the set of solution :
    • is non empty
    • contains only KKT points
    • is confined to a bounded set

After defining an well posed NLP, next step is to define a function \(G(\theta)\) so that the asymptotically stable points of the ODE with vector field \(G(\theta)\) are the KKT points of the optimization problem. i.e. the vector field needs to be coercieve with respect to our NLP.

Coercivity:

A vector field \(G\) is coercive if the stable points of its ODE correspond to KKT point of optimization problem.

Theorem 2.8 If a vector field is coercive then an deterministic algorithm driven by that field will approximate the solution.

3. Stochastic Approximation

Here, we have similar iterative algorithms but instead of gradient \(G(\theta_n)\) we have a random variable \(Y_n\) that approximates it.

\begin{align*} \theta_{n+1} = \theta_n + \epsilon_n Y_n \end{align*}

The above is stochastic approximation when \(\{Y_n = \phi(\theta_n, \xi_n)\}\) is an stochastic process. \(\{\xi_n\}\) is called the underlying process. It encodes the randomness in the feedback signal \(\phi\).

We could directly use \(\phi(\theta_n, \xi_n)\) at each iteration, or we might samples a lot of observations and use that to construct \(Y_n\). That way we get lower variance estimate, but we tradeoff number of samples.

3.1. Applications

Lets see some applications.

3.1.1. Root finding

Finding a parameter \(\theta\) so that a system output reached a specified value \(\alpha\). It is also called target tracking problem in control literature. This problem is an example of unsupervised learning.

E.g. Thermostat: The temperature readings \(L(\theta)\) are stochastic and we want it to match a desired temperature \(\alpha\).

Robbins-Monro Algorithm:

It proves that SA converges to root of a function if the step size satisfy a condition called the Robbins-Monro Conditions: \(\sum \epsilon_n = \infty\) and \(\sum \epsilon^2 < \infty\)

The proof assumes that \(L(\theta)\) is nondecreasing..

3.1.2. Statistical Fitting (Regression)

Take as an example:

\begin{align*} J(\theta) = \frac 1 2 \mathbb E [\left( Z(X) - (\theta_1 + \theta_2 X) \right)^2] \end{align*}

We can derive the expression for gradient and use an iterative algorithm. But we don't know the (random) function \(Z\), we just have estimate of \(Z(x)\) from random samples of \(x\).

Assuming that \(\{x_n\}\) are independent and randomly distributed we can use SA algorithm called Least mean square (LMS) or recursive mean square algorithm. [# 95]

3.2. Problem Types

3.2.1. Static Problem

  • \(G(\theta)\) is a vector field whose zeros we want to find.
  • We can't observe \(G(\theta)\)
  • It is governed by an underlying process \(\xi(\theta)\) as \(G(\theta) = \mathbb E [ g(\xi(\theta, \theta))]\)
  • But we don't know \(g\) either.
  • Instead we just have a feedback function \(\phi(\xi(\theta), \theta)\) as a proxy for \(g\)
  • So, take

    \begin{align*} G(\theta) = \mathbb E [\phi(\xi(\theta), \theta)] + \beta(\theta) \end{align*}
  • Take initial value \(\theta_0\), and step size sequence following Robbin-Monro conditions and do \(\theta\) update

    At each step of update, we have \(\theta_n\) and we sample \(\xi_n(\theta)\) and evaluate the feedback function \(\phi\) and perform the update with stepsize \(\epsilon_n\).

3.2.2. Markovian Problem

For each fixed value of control parameter \(\theta\), the underlying process \(\{\xi_n(\theta)\}\) is a Markov chain.

And vector field \(G\) is meant to capture long-term behaviour of the model,

\begin{align*} G = \lim_{N \to \infty} \frac 1 N \mathbb E \left[\sum_{n=1}^N g(\xi_n(\theta), \theta) \right] = \int g(\xi_n(\theta), \theta) \pi_\theta(d\xi) \end{align*}

3.3. Sample Average Approach (SAA)

aka Stochastic counterpart, or Sample path optimization

is a technique where we convert stochastic problem to deterministic by replacing \(J\) with \(\hat J\) which is average of \(n\) iid samples.

Then we take derviative of \(\hat J\) and proceed with a deterministic iterative algorithm.

This only works if feedback function \(L(\theta, \xi_i)\) has analytical form, and when we have access to data all at once ie. it is not streaming.

4. Static Model

Feedback = Actual gradient + Margtingale noise + Bias

\(Y_n = G(\theta_n) + \delta M_n + \beta_n\)

  • Noise is Martingale noise

    This is a broader and more useful condition than requiring noise to be independent and identically distributed (i.i.d.)

  • Bias tends to zero.

Convergence results:

  • Under robinson monroe conditions SA converges to the stable point of ODE \(dx/dt = G(x)\)
  • TODO: Theorem 4.2
  • Under constant step size, SA hovers near the true optimal

    As \(\epsilon \to 0\) the trajectory of SA converges in distribution to the deterministic trajectory of the ODE.

Application:

  • Kiefer-Wolfowitz (Finite difference): To estimate gradient

    This has a bias, so the difference \(c_n\) must goto zero as \(n \to \infty\) to ensure that bias goes to zero.

  • Clipping and Truncation: Clip the gradient to bound the variance. Truncate \(\theta\) to a bounding box to prevent wandering off.

    When variance of gradient noise is too large, SA can get unstable or wander off.

    Theorem 4.1 requires that variance of feedback term \(V_n\) is bounded such that: \(\sum \epsilon^2 V_n < \infty\)

    but this can't always be guranteed. E.g. in Example 3.4 (Linear Regression) the feedback depends on the current term \(\theta_n\) and if \(\theta\) wanders off, the variance can't be guranteed. Clipping and truncation enforce this bounds.

    All this works if the field G is coerceive.

    Theorem 4.2 formalizes this idea.

5. Definitions

  • Descent direction: Any vector \(d(\theta)\) such that \(\nabla J(\theta) d(\theta) < 0\) [# 23]

6. Theorems

Lemma 1.1 Multiplying negative gradient direction by any positive definite matrix still gives a descent direction. [# 23]

Theorem 1.7 [# 34] The lagrange multiplier can be viewed as the rate of change of the optimal cost as the level of constraint changes.

If we take an optimization problem with just equality constraint: \(h(\theta) = u\), and let the lagrangian be:

\begin{align*} \mathcal L(\theta, \eta) = J(\theta) + \eta h(\theta) \end{align*}

Then the rate of change of optimal cost when the equality constraint changes is given by \(\eta\). i.e. for constraint value \(u\), if \(\theta^*(u)\) is the optimal point and \(\eta(u)\) is the value of lagrange multiplier then,

\begin{align*} \nabla_u J(\theta^*(u)) = -\eta(u) \end{align*}

Backlinks


You can send your feedback, queries here