This is the second blog post in the series of blog posts on kernels. In the first part I introduced the functional analysis background for kernel theory. I highly recommend you read it before continuing. I will frequently refer to it and use the same notation. In this blog post I aim to introduce the fundamental theorems like Mercer's theorem and Representer theorem.
We already defined the notion of reproducing kernels for an RKHS (Definition 20). We now turn our attention to obtaining necessary and sufficient conditions for a function \(K \colon \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) to be the reproducing kernel for some RKHS. But first we must define kernel functions.
Note that if we define a vector \(x \in \mathbb{C}^n\) to be such that its \(i^{\text{th}}\) component is \(\alpha_i\), then the condition above can rewritten as
Also note that if \(A \geq 0\) then \(A = A^*\), where \(A^* = \overline{A^\top}\) denotes the Hermitian matrix (also known as the self-adjoint matrix). Therefore, positivity gives self-adjoint property for free if we are dealing with complex matrices. Things aren't so elegant for real matrices. For the real case we need to explicitly state that the matrix \(A\) is also symmetric apart from what's stated in the definition above. Therefore, we often use the following as the definition of positive matrices:
Positive matrices are also alternatively called positive semidefinite or nonnegative matrices.
The following lemma connects the concept of positive matrices to its eigenvalues.
Let us first suppose \(A \geq 0\), then \(A = A^*\) by definition. Now if \(\lambda\) is an eigenvalue of \(A\) and \(v\) is an eigenvector of \(A\) corresponding to the eigenvalue \(\lambda\), we have
For the other side, find an orthonormal basis \(v_1, \ldots, v_n\) consisting of eigenvectors of \(A\) (it exists by the spectral theorem). Let \(v_i\) be the eigenvector corresponding to the eigenvalue \(\lambda_i.\) Then for any \(x = \sum_{i=1}^n \alpha_i v_i\) we have \(\left\langle Ax,x \right\rangle = \sum_{i=1}^n \lambda_i \lvert \alpha_i\rvert^2.\) Since \(\lambda_i \geq 0\), \(\left\langle Ax,x \right\rangle \geq 0\) and \(A\) must be positive.
We are now ready to define a kernel function.
Kernel functions are alternatively called positive definite functions or positive semidefinite functions.
Some examples follow:
Linear kernels: When \(\mathcal{X} = \mathbb R^d\), we can define the linear kernel function as
Polynomial kernels: A natural generalization of the linear kernel on \(\mathbb R^d\) is the homogeneous polynomial kernel
of degree \(m \geq 2\), also defined on \(\mathbb R^d.\) It is clearly a symmetric function. To prove positivity, note that
This will have \(D = \binom{m+d-1}{m}\) monomials, so to simplify the analysis let's take \(m=2.\) Then
In this case \(D = \binom{d+1}{d} = d + \binom{d}{2}.\) Define a mapping \(\Phi\colon \mathbb R^d \to \mathbb R^D\) such that
Then
Following the same argument as the first example, we can verify that the gram matrix thus formed is indeed positive.
The mapping \(x \mapsto \Phi(x)\) is often referred to as a feature map. We see that dealing with elements in the feature space, i.e., the codomain of \(\Phi\), is computationally expensive. The relation \(K(x,y) = \left\langle \Phi(x),\Phi(y)\right\rangle\) allows us compute the inner products using the kernel function instead of actually taking the inner product in a very high dimensional space. We will see that this "kernel trick" holds for many kernel functions when we discuss Mercer's theorem.
Gaussian kernels: Given some compact subset \(\mathcal{X} \subseteq \mathbb R^d\) and \(\sigma \in \mathbb R\), the Gaussian kernel is
It is not obvious why this is a kernel function.
Let us return to the characterization of reproducing kernels. We will now prove that a function is a kernel function if and only if there is an RKHS for which it is the reproducing kernel. At this point recall Theorem-3 which states that an RKHS admits a unique reproducing kernel.
What does it mean in the above proof if we have an equality? That is, if \(\left\langle \mathbf{K} \alpha,\alpha \right\rangle = 0\)? This happens if and only if \(\left\lVert \sum_{i=1}^n \alpha_i k_{x_i} \right\rVert = 0.\) But this means that for every \(f \in \mathcal{H}\) we have \(\sum_{i=1}^n \overline{\alpha_i} f(x_i) = \left\langle f,\sum_i \alpha_i k_{x_i} \right\rangle = 0.\) Thus, in this case there is an equation of linear dependence between the values of every function in \(\mathcal{H}\) at this finite set of points.
Now let us state the converse of Theorem-4. It is a deep result in RKHS theory known as the Moore–Aronszajn theorem.
For a proof see here on Wikipedia.
In light of these two theorems we have the following notation.
It is not an easy problem to start with a kernel function \(K\) on some set \(\mathcal{X}\) and give a concrete description of \(\mathcal{H}_K.\) I will not be discussing this here, but there are plenty of resources available where you can see this problem getting discussed. I recommend Paulsen and Raghupathi's An Introduction to the Theory of Reproducing Kernel Hilbert Spaces.
Recall the spectral theorem for finite-dimensional vector spaces: A linear operator \(T\colon V \to V\) for some finite dimensional vector space \(V\) on \(\mathbb{C}\) is normal, i.e., \(T T^* = T^* T\), if and only if \(V\) has an orthonormal basis consisting of eigenvectors of \(T.\) This implies that if \(\mathbf{U} = [v_1, \ldots, v_n]\) is a unitary matrix containing the \(i^{\text{th}}\) eigenvector in its \(i^{\text{th}}\) column and \(\mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)\) is a diagonal matrix containing the corresponding eigenvalues, then if \(\mathbf{K}\) is a normal matrix then it can written as
Mercer's theorem generalizes this decomposition to kernel functions. Let us start by defining a special type of kernel function.
Recall the space \(L^2(\mu)\) from Definition-8, where we now take \(\mathcal{X}\) (written as \(X\) there) to be a compact metric space.
We assume that the Mercer kernel satisfies the Hilbert-Schmidt condition, stated as
Indeed, we have
Operators of this type are known as Hilbert-Schmidt operators.
We are now ready to state the Mercer's theorem.
I'll skip the proof.
Among other things, Mercer's theorem provides a framework for embedding an element of \(\mathcal{X}\) into an element of \(\ell^2(\mathbb N)\) (for its definition, see Example-4 in the section on Hilbert spaces in the previous part). More concretely, given the eigenfunctions and eigenvalues guaranteed by Mercer's theorem, we may define a mapping \(\Phi : \mathcal{X} \to \ell^2(\mathbb N)\) as follows
This is the well-known "kernel trick". Let us connect Mercer's theorem to the spectral theorem.
Let \(\mathcal{X} = [d] := \{1, 2, \ldots, d\}\) along with the Hamming metric be our compact metric space. Let \(\mu(\{i\}) = 1\) for all \(i \in [d]\) be the counting measure on \(\mathcal{X}.\) Any function \(f : \mathcal{X} \to \mathbb{C}\) is equivalent to the \(d\)-dimensional vector \([f(1), \ldots, f(d)]\), and any kernel function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) is continuous, satisfies the Hilbert-Schmidt condition and is equivalent to a \(d \times d\) normal matrix \(\mathbf{K}\) where \(\mathbf{K}_{i,j} = K(i, j).\) The Hilbert-Schmidt operator reduces to
Let us now consider how various algebraic operations on kernels affect the corresponding Hilbert spaces. All this (and a lot more) can be found in the seminal paper by Aronszajn "Theory of Reproducing Kernels".
I state the following theorems without proof to illustrate how operations on kernels are done.
Let us first define the notion of tensor product of two (separable) Hilbert spaces \(\mathcal{H}_1\) and \(\mathcal{H}_2\) of functions, say with domains \(\mathcal{X}_1\) and \(\mathcal{X}_2.\)
We can now state the theorem for product of kernels.
We can similarly define more operations on kernels:
If \(K\) is a valid kernel and \(\alpha \geq 0\), then \(\alpha K\) is a valid kernel.
If \(K\) is a valid kernel and \(\alpha \geq 0\), then \(K + \alpha\) is a valid kernel.
We can easily see from all these results that a linear combination or more generally for any polynomial \(P\) with positive coefficients, the composition \(P \circ K\) is a valid kernel if \(K\) is a valid kernel.
If \(K\) is a valid kernel, then \(\exp(K)\) is a valid kernel.
We are now at a stage where we can put all this theory to use in machine learning. Specifically we will develop Representer theorem which allows many optimization problems over the RKHS to be reduced to relatively simple calculations involving the gram matrix.
Let us start with a functional analytic viewpoint of supervised learning. Suppose we are given empirical data
We need to find some function \(f^*\) which "best" approximates \(g.\) A natural way to formalize the notion of "best" is to limit ourselves to an RKHS \(\mathcal{H}\) which contains functions of the form \(f : \mathcal{X} \to \mathbb R\) and choose
This optimization problem is feasible whenever there exists at least one function \(f \in \mathcal{H}\) that fits the data exactly. Denote by \(y\) the vector \([y_1, \ldots, y_n]^\top.\) It can be shown that if \(\mathbf{K}\) is the gram matrix of the kernel \(K\) with respect to \(x_1, \ldots, x_n\), then the feasibility is equivalent to \(y \in \text{range}(\mathbf{K}).\) As we will see, this is a special case of the representer theorem.
In a realistic setting we assume that we have noisy observations, i.e.,
I'll skip the proof which can be found in Paulsen and Raghupathi's book An Introduction to the Theory of Reproducing Kernel Hilbert Spaces or in Schölkopf, Herbrich and Smola's paper A Generalized Representer Theorem.
As an example \(L_y\) could be the squared loss \(L_y(f(x_1), \ldots, f(x_n)) = \sum_{i=1}^n (y_i - f(x_i))^2.\) If we assume \(L_y\) to be convex then the solution to (1) exists and is unique. Recall that \(E \subseteq \mathbb R^k\) is a convex set if \(\lambda x + (1-\lambda) y \in E\) whenever \(x,y \in E\) and \(0 \le \lambda \le 1.\) A real-valued function \(L\) defined on a convex set \(E\) is a convex function if \(L(\lambda x + (1-\lambda) y) \leq \lambda L(x) + (1-\lambda) L(y)\) whenever \(x, y \in E\) and \(0 \le \lambda \le 1.\) Note that a convex function is continuous on the interior of its domain. Also check out Alexandrov theorem and Rademacher’s theorem.
The Representer theorem allows us to reduce the infinite-dimensional problem of optimizing over an RKHS to an \(n\)-dimensional problem.
I barely scratched the surface of reproducing kernel Hilbert space theory. Some resources I recommend that do a much better job than me in explaining this theory and which I used as reference are:
Paulsen V and Raghupathi M: An Introduction to the Theory of Reproducing Kernel Hilbert Spaces
Wainwright M: High-Dimensional Statistics
Schölkopf B, Herbrich R and Smola A.J: A Generalized Representer Theorem (COLT 2001)
Aronszajn N: Theory of Reproducing Kernels (1950)
In the next article I will discuss kernel approximation methods. In particular I will focus on random Fourier Features developed by Rahimi and Recht which I am using in my work.