Residual Connection

1. How do Residual connection help in learning?
2. Thoughts

1. How do Residual connection help in learning?

References:

Visualizing the Loss Landscape of Neural Nets [pdf][arXiv]
Gradients Explode - Deep Networks are Shallow - Residual explained [pdf][ICLR 2018]

Figure 1: Mechanisms that make Residual Connection effective for learning in deep neural network. The mathematical formulation of diluting non-linearity is not shown here. Generated using NotebookLM.

1.1. Identity Baseline

Instead of having to learn \(F(x) = y(x)\) the network needs to learn \(F(x) = y(x) - x\).

At initialization, network defaults to an identity function, providing a stable starting point for SGD.

Optimization Target: Instead of discovering a complex function from scratch, the optimizer only needs to learn the correction (residual) to the current state.

1.2. Gradient Dynamics and Conditioning

In plain networks, backpropagation involves repeated Jacobian multiplications, leading to vanishing or exploding gradients.

In \(y = F(x) + x\), the gradient \(\frac {\partial y}{\partial x} = \frac {\partial F}{\partial x} + I\) ensures a constant signal passes through, regardless of the complexity of \(F\). This means the layer Jacobian stays close to identity when the residual weights are small.

With Jacobian close to identity, we have better conditioning which allows for larger learning rates and more stable updates without the risk of divergence.

1.3. Ensemble Effect

A network with \(n\) blocks contains \(2^n\) possible paths from input to output. Most gradient signal during training is carried by shorter paths. Deep paths contribute less to the total gradient, effectively making the model an ensemble of relatively shallow networks.

Residual blocks are weakly dependent, functioning more like additive refinements than sequential transformations. Removing a single block during inference causes minimal performance degradation (unlike plain networks where every layer is a critical dependency). This phenomenon is described in the paper "Residual Networks Behave Like Ensembles of Relatively Shallow Networks".

1.4. Loss Landscape Geometry

The structural changes translate into a more "trainable" mathematical surface.

Flatness: Skip connections smooth the loss landscape, reducing the prevalence of sharp minima and chaotic curvatures.
Connectivity: Low-loss regions are better connected, helping minibatch SGD avoid getting trapped in poor local optima.
Convex-like Behavior: By reducing the effective nonlinearity per layer, the local curvature becomes less extreme, making first-order optimizers (SGD/Adam) behave more like second-order methods.

Figure 2: Smooth loss landscape when using skip connections. From paper: Visualizing the Loss Landscape of Neural Nets #

Also the training dynamics are better for network with skip connection. The non-chaotic landscape are dominated by wide, nearly convex minimizers. If we visualize the training dynamics by using PCA of the weight changes then we see that 40%-90% of the descent path is along the first two principal components. i.e. gradient descent path is very low dimentional. [Section 7 for paper Visualizing the Loss Landscape of Neural Nets #]

1.5. Diluting Non-linearity

Skip connection dilutes the non-linearity and reduce exploding gradient problem. Since, diluting a nonlinearity by \(k\) reduces the gradient by a factor of \(k^2 +1\). A small reduction in representational capacity achieves a large amount of graident reduction and thus allow better training.

This concept is defined in the Gradients Explode paper.

k-diluted: If we view a function as composed of linear component and non-linear component, then the function is k-diluted if the linear component is \(k\) times larger than the non-linear components in quadratic expectation. This effectively means for large \(k\), the function is well approximated by a linear function.

Formally [#], a function \(f\) is k-diluted wrt to a random vector \(v\) if there exits a matrix \(S\) and a function \(\rho\) such that

\begin{align*} f(v) = Sv + \rho(v) &&\textrm{and}&& \frac {\mathbb Q_v ||Sv||_2} {\mathbb Q_v ||\rho(v)||_2} = k \end{align*}

where \(\mathbb Q\) is quadratic expectation \(\mathbb Q[X] = \mathbb E [X^2]^{1/2}\)

2. Thoughts

In the paper Transformers Represent Belief State Geometry in their Residual Stream - Page 8, the authors hypothesize that residual connection is necessary for the transformer/RNN to learn the belief state geometry.