Gradients Explode - Residual Explained
Paper: Gradients Explode - Deep Networks Are Shallow - ResNet Explained [pdf][ICLR 2018]
Introduce Gradient scale coefficient (GSC) to measure graident exploding
Instead of looking at just the raw norm of the gradient, GSC = ratio of gradient norm to forward activation norm
This removes confounders like network scaling and layer width.
Found that exploding gradients are present in many MLP architectures even though they use normalization or skip connections.
Sometimes normalization may even increase the exploding gradients problem.
- Define collapsing domain problem: Activation of different datapoints become so similar that the non-linearlity can be approximated by linear layers. Thus deep networks effectively become shallow ones. This they call pseudo-linearity which drastically destroyes the networks ability to learn complex representation.
They defined what diluting a non-linearlity is. And show that diluting a nonlinearity by \(k\) reduces the gradient by a factor of \(k^2 +1\).
k-diluted: If we view a function as composed of linear component and non-linear component, then the function is k-diluted if the linear component is \(k\) times larger than the non-linear components in quadratic expectation. This effectively means for large \(k\), the function is well approximated by a linear function.
Formally [#], a function \(f\) is k-diluted wrt to a random vector \(v\) if there exits a matrix \(S\) and a function \(\rho\) such that
\begin{align*} f(v) = Sv + \rho(v) &&\textrm{and}&& \frac {\mathbb Q_v ||Sv||_2} {\mathbb Q_v ||\rho(v)||_2} = k \end{align*}where \(\mathbb Q\) is quadratic expectation \(\mathbb Q[X] = \mathbb E [X^2]^{1/2}\)
- Skip connection dilutes the non-linearity and reduce exploding gradient problem. Since, diluting a nonlinearity by \(k\) reduces the gradient by a factor of \(k^2 +1\). A small reduction in representational capacity achieves a surprisingly large amount of graident reduction and thus allow better training.