Representer Theorem
Table of Contents
Say, we have a dataset (input-output pairs \(\{(x_i, y_i)\}\)) and want to find a function from among a space of functions, that maps from input to output while mimizing an error function.
If the search space of function is infinite, finding an optimal function is a daunting task. Representer theorem gives us an representation for the optimal function i.e. it says what the optimal function looks. And in doing so, make the job of finding the optimal function easier.
This theorem works only when we are working with a special kind of space of functions which is the space of function generated by a kernel \(K(x_1, x_2)\). Such as space is called Reproducing Kernel Hilbert Space (RKHS) and the optimal function in RKHS is given as a linear combination of the kernel function over the dataset we have:
\begin{align*} f^{\star}(x) = \sum_{i=1}^{n} \alpha_i K(x, x_i) \end{align*}To understand what this is lets go through the details one by one.
1. Positive-Definite Kernel
Kernels in Kernel Methods (e.g. SVM, Kernel Regression) are positive-definite kernel. They are defined as follows:
Let \(X\) be a non-empty set, referred to as index set. Then a function \(K: X \times X \to \mathbb R\) is a kernel if it
- is symmetric: \(K(x, y) = K(y, x)\)
is positive definite:
For any finite subset \(\{x_1, x_2, ..., x_n\} \subset X\), the matrix \(\mathbf K_{ij} = K(x_i, x_j)\) is positive definite.
2. Examples of Positive Definite Kernels
Some kernels defined on Euclidean Space \(\mathbb R^d\) are:
Linear Kernels
\(K(x, y) = x^T y\)
Polynomial Kernel
\(K(x,y) = (x^T y + c)^d\) for \(c \geq 0, d \geq 1\)
Gaussian (RBF) Kernel
\(K(x,y) = \exp(-\frac {||x - y||^2} {2 \sigma^2})\) for \(\sigma > 0\)
Sigmoid Kernel
\(K(x,y) = \tanh(\alpha x^T y + c)\)
Laplacian Kernel
\(K(x, y) = e^{-\alpha ||x - y||}\) for \(\alpha > 0\)
Abel Kernel
\(K(x, y) = e^{-\alpha |x - y|}\) for \(\alpha > 0\)
3. Relation with RKHS
For each kernel function there exists a hilbert space, i.e. a space of functions, that satisfies some properties (See Reproducing Kernel Hilbert Space). Each function in that space can be represented as a (possibly infinite) linear combination of kernel functions centered at points in the domain:
\begin{align*} f(\cdot) = \sum_{j=1}^N c_j K(\cdot, x_j) \end{align*}This property is given by Moore-Aronszajn theorem which says:
Suppose \(K\) is a symmetric, positive definite kernel on a set \(X\). Then there is a unique Hilbert space \(H_K\) of functions on \(X\) for which \(K\) is a reproducing kernel and such hilbert space is a Reproducing Kernel Hilbert Space (RKHS).
4. Representer Theorem
It essentially says that for some training dataset, and some kernel function \(K\), the optimal function (that reduces an error for that dataset) among functions in \(H_K\) (the RKHS of \(K\)) can be represented as just the linear combination of the kernel \(K\) centered at the input points. i.e.
\begin{align*} f^{\star}(\cdot) = \sum_{i=1}^{n} \alpha_i K(\cdot, x_i) \end{align*}Thus the search for optimal function over an infinite dimensional space \(H_K\) is replaced by an optimization over finite number of coefficient in that linear combination (denoted \(\alpha_i\)). This is the power of representer theorem.
More formally, when we have
- a positive definite real valued kernel \(K: X \times X \to \mathbb R\) and the corresponding Reproducing Kernel Hilbert Space \(H_K\),
- some training samples \((x_1, y_1), ..., (x_n, y_n) \in X \times \mathbb R\)
- an arbitrary error function \(E(x_1, y_1, f(x_1), ..., x_n, y_n, f(x_n))\)
Then the minimizer of the regularized empirical risk:
\begin{align*} f^{\star} = \underset{f\in H_K}{\arg \min} E(x_1, y_1, f(x_1), ...) + g(||f||) \end{align*}(where \(g: [0, \infty] \to \mathbb R\) a is strictly increasing real-value function)
can be represented in the following form:
\begin{align*} f^{\star}(\cdot) = \sum_{i=1}^{n} \alpha_i K(\cdot, x_i) \end{align*}The optimal values of \(\alpha_i \in \mathbb R\) can be found using optimization techniques as per the form of the error function.
(For a proof see this.)
4.1. An Example with Least squares error function
Representer theorem is useful becuase they give us a representation of the optimal function which can then be manipulated analytically or numerically to obtain the optimal function.
If we take the error function as:
\begin{align*} E = \sum_{i=1}^n (y_i - f(x_i))^2 \end{align*}and the regularization function as \(g(x) = \lambda x^2\) to get to the minimization problem:
\begin{align*} f^{\star} = \underset{f \in H} {\arg\min} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda ||f||_H^2 \end{align*}Then \(f^{\star}\) can be represented as
\begin{align*} f^{\star}(\cdot) = \sum_{i=1}^{n} \alpha^{\star}_i K(\cdot, x_i) \end{align*}and the optimal \(\alpha^{\star}\) is given by the following expression with can be obtained numerically by a linear solve:
\begin{align*} \alpha^{\star} = (A^T A + \lambda A)^{-1} A^T y \end{align*}where, \(A_{ij} = K(x_i, x_j)\). (See this for derivation)