Gradient Estimation on Quantum Hardware
Table of Contents
Rough notes from paper "Estimating the gradient and higher-order derivatives on quantum hardware" [pdf][arXiv]
This paper gives does/gives the following:
- Formulas to estimate derivatives using parameter shifts
- The MSE error of those derivative estimators
- Using those estimators to implement quantum optimizers
- E.g. Hessian estimator and Newton optimizer
Flow:
- Quantum computer measures some expectation (?) values
Which is combined classically to get a cost function
E.g. cost could be expectation of that observed value
- Optimizers require derivative of expectation values wrt variational parameters
- Analytic gradient of quantum circut can be experimentally evaluated using parameter-shift rule.
Classical parameters \(\theta = (\theta_1, ..., \theta_m)\) produce a family of operators. i.e. for a particular value of parameters we have a quantum circuit that represents an unitary operation \(U(\theta)\). We apply the circuit to state \(|0 \rangle\) and measure the observable \(M\) to be:
\begin{align*} f(\theta) = \langle 0 | U(\theta)^\dagger M U(\theta) | 0 \rangle \end{align*}As an use case, the observable \(M\) could be hamiltonian of a system and \(f(\theta)\) would be energy which we try to minimize to find the ground state of the system.
1. Gradient
An arbitrary operator can be represented as
\begin{align*} \hat K(\theta_j) = \hat A + \hat B \cos(\theta_j) + \hat C \sin(\theta_j) \end{align*}And thus, its derivative can be expressed as:
\begin{align*} \frac {\partial} {\partial \theta_j} \hat K (\theta_j) = \frac {\hat K(\theta_j + s) - \hat K(\theta_j - s)} {2 \sin (s)} \end{align*}This is not finite difference, this is an exact relation.
Using a value of \(s\) such as \(s = \pi / 2\) we get an specific parameter-shift rule to get the derviative.
2. Hessian and higher order derivatives
The hessian and other derivatives also follow similarly but apply the same method repeatedly. Specifically, for \(d\) order derviative we evaluate \(f\) at \(2^d\) shifted points to obtain the gradient:
\begin{align*} g_{j_1, ..., j_d}(\theta) = \frac 1 {2^d} \sum_k P(k) f(\theta + k) \end{align*}where, the sum is over all possible choices of sign of basis \(k \in \frac \pi 2( \pm e_{j_1}, ..., \pm e_{j_d})\) and \(P(k)\) gives the sign parity of \(k\).
See formula of Hessian for an example pg 3.
3. Stochastic estimate of Gradient
We can't measure \(f(\theta)\) instead we have a noisy estimate, with mean zero and variance \(\sigma^2 = \frac {\sigma_0^2} N\) depending on the number of shots \(N\).
\begin{align*} \hat f (\theta) = f(\theta) + \hat \epsilon \end{align*}If we use this noise \(\hat f(\theta)\) instead of \(f(\theta)\) in our gradient formula, we have a Analytic estimator for derviative. [pg 5]
Error in the estimate is caused by two terms:
Bias
\begin{align*} Bias(\hat g) = E(\hat g) - g \end{align*}Variance
\begin{align*} Var(\hat g) = E(\hat g^2) - E(\hat g)^2 \end{align*}
These two lead to incease in MSE of the estimator.
\begin{align*} \Delta (\hat g) = E[(\hat g - g)^2] = Var(\hat g) + Bias(\hat g)^2 \end{align*}3.1. MSE of Parameter-shift gradient estimator
The bias for this estimator is zero, and the variance is \(Var(\hat g_j^{(s)}) \approx \frac {\sigma_0^2} {2 N \sin(s)^2}\)
So the MSE of estimator is only due to statistical noise, and is equal to the variance.
For a maximum shift of \(s = \pi / 2\) the MSE is \(\Delta (\hat g_j) \approx \frac { \sigma_0^2 } {2 N}\)
One variation on this is a scaled parameter shift estimator, \(\hat g_j^{(\lambda, s)} = \lambda \hat g_j^{(s)}\).
This estimator is biased and has \(\lambda^2\) times higher variance. So, for \(\lambda < 1\) we get less variance, but the estimator is biased. However, with an optimal choice of \(\lambda\) we can get the MSE be
\begin{align*} \Delta (\hat g_j^{(\lambda, s)}) = \lambda^\star Var(\hat g_j^{(s)}) \end{align*}This is always more accurate than the simple parameter shift. And the optimal \(\lambda^\star\) is a constant that can be absorbed into learning rate and tuned as a hyperparameter in ML algorithm.
3.2. MSE of Finite difference
Finite difference estimators give a biased estimate. However, the Parameter shift gradient estimator is only valid for circuit with Unitary operator while the finite difference estimator are valid for more general circuits.
For central difference formula, the MSE of the estimator is proportional to \(N^{-2/3}\) for optimal choice of step size \(h\).
For forward difference formula, the MSE of the estimator is proportional to \(N^{-1/2}\) for optimal choice of step size \(h\).
For small step size \(h \to 0\) the finite difference estimator is approximately equal to parameter shift estimator with \(s = h\).
4. Optimizers
With the graident estimates we can implement various kinds of optimizers:
Gradient descent optimizer
Requires the gradient
Netwon optimizer
Requires the inverse of hessian
Diagonal netwon optimizer
Replaces the hessian with just its diagonal version. One good thing is that the same parameter shift that are used to compute gradient give the diagonal version of hessian.
Quantum natural gradient optimizer
It uses something called the Fubini-study metric tensor in place of the hessian.
Stabilizing second order methods:
The second order methods look like
\begin{align*} \theta^{t} = \theta^{t-1} - \eta A^{-1} \nabla f(\theta^{t-1}) \end{align*}If \(A\) is not positive definite the optimizer can converge to maximum instead of minimum. And, if eigenvalue of \(A\) is near zero, the optimizer get unstable.
To stabilize them, normalization techniques can be used
- by replaceing \(A\) with \(A' = A + \epsilon 1\) which is same as shifting all eigenvalue by \(\epsilon\)
- or we can take each eigenvalue to be \(\lambda_k' = \max (\lambda_k, \epsilon)\)