Aditya Makkar
Kernels - Part 1
  1. Introduction
  2. Characterization of Reproducing Kernels
    1. Examples
    2. Equivalence between kernel function and reproducing kernel
  3. Mercer's Theorem
  4. Operations on Kernels
    1. Sums of kernels
    2. Products of kernels
    3. Other operations
  5. Representer theorem
  6. Epilogue

Introduction

This is the second blog post in the series of blog posts on kernels. In the first part I introduced the functional analysis background for kernel theory. I highly recommend you read it before continuing. I will frequently refer to it and use the same notation. In this blog post I aim to introduce the fundamental theorems like Mercer's theorem and Representer theorem.

Characterization of Reproducing Kernels

We already defined the notion of reproducing kernels for an RKHS (Definition 20). We now turn our attention to obtaining necessary and sufficient conditions for a function \(K \colon \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) to be the reproducing kernel for some RKHS. But first we must define kernel functions.

Definition 21: Let \(A = (a_{i,j})\) be an \(n \times n\) complex matrix. Then \(A\) is called positive if for every \(\alpha_1, \ldots, \alpha_n \in \mathbb{C}\) we have
\[\begin{aligned} \sum_{i,j = 1}^{n} \overline{\alpha_i}\alpha_j a_{i,j} \geq 0.\end{aligned}\]
We denote this by \(A \geq 0.\)

Note that if we define a vector \(x \in \mathbb{C}^n\) to be such that its \(i^{\text{th}}\) component is \(\alpha_i\), then the condition above can rewritten as

\[\begin{aligned} \left\langle Ax,x \right\rangle \ge 0.\end{aligned}\]

Also note that if \(A \geq 0\) then \(A = A^*\), where \(A^* = \overline{A^\top}\) denotes the Hermitian matrix (also known as the self-adjoint matrix). Therefore, positivity gives self-adjoint property for free if we are dealing with complex matrices. Things aren't so elegant for real matrices. For the real case we need to explicitly state that the matrix \(A\) is also symmetric apart from what's stated in the definition above. Therefore, we often use the following as the definition of positive matrices:

Definition 21': An \(n \times n\) matrix \(A\) is positive, in symbols \(A \geq 0\), if it is self-adjoint and if \(\left\langle Ax,x \right\rangle \ge 0\) for all \(x \in \mathbb{C}^n.\)

Positive matrices are also alternatively called positive semidefinite or nonnegative matrices.

The following lemma connects the concept of positive matrices to its eigenvalues.

Lemma 2: A matrix \(A \geq 0\) if and only if \(A = A^*\) and every eigenvalue of \(A\) is nonnegative.

Let us first suppose \(A \geq 0\), then \(A = A^*\) by definition. Now if \(\lambda\) is an eigenvalue of \(A\) and \(v\) is an eigenvector of \(A\) corresponding to the eigenvalue \(\lambda\), we have

\[\begin{aligned} 0 \leq \left\langle Av,v \right\rangle = \left\langle \lambda v,v \right\rangle = \lambda \left\langle v,v \right\rangle.\end{aligned}\]
Since \(\left\langle v,v \right\rangle \ge 0\), \(\lambda\) is a nonnegative number.

For the other side, find an orthonormal basis \(v_1, \ldots, v_n\) consisting of eigenvectors of \(A\) (it exists by the spectral theorem). Let \(v_i\) be the eigenvector corresponding to the eigenvalue \(\lambda_i.\) Then for any \(x = \sum_{i=1}^n \alpha_i v_i\) we have \(\left\langle Ax,x \right\rangle = \sum_{i=1}^n \lambda_i \lvert \alpha_i\rvert^2.\) Since \(\lambda_i \geq 0\), \(\left\langle Ax,x \right\rangle \geq 0\) and \(A\) must be positive.

We are now ready to define a kernel function.

Definition 22: Let \(\mathcal{X}\) be a set, then \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) is called a kernel function if for every \(n \in \mathbb N\) and for every choice of \(\{x_1, \ldots, x_n\} \subseteq \mathcal{X}\), the matrix \((K(x_i, x_j)) \geq 0.\) We will use the notation \(K \geq 0\) to denote that the function \(K\) is a kernel function.

Kernel functions are alternatively called positive definite functions or positive semidefinite functions.

Definition 23: Given a kernel function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb C\) and points \(x_1, \ldots, x_n \in \mathcal{X}\), the \(n \times n\) matrix \((K(x_i, x_j))\), denoted \(\mathbf{K}\), is called the Gram matrix of \(K\) with respect to \(x_1, \ldots, x_n.\)

Some examples follow:

Examples

  1. Linear kernels: When \(\mathcal{X} = \mathbb R^d\), we can define the linear kernel function as

    \[\begin{aligned} K(x,y) = \left\langle x,y\right\rangle.\end{aligned}\]
    It is clearly a symmetric function of its arguments, and hence self-adjoint. To prove positivity, let \(x_1, \ldots, x_n \in \mathbb R^d\) be an arbitrary collection of points, and consider its gram matrix \(\mathbf{K}\), i.e., \(\mathbf{K}_{i,j} = K(x_i, x_j) = \left\langle x_i,x_j\right\rangle.\) Then for any \(\alpha \in \mathbb R^n\), we have
    \[\begin{aligned} \left\langle \mathbf{K} \alpha, \alpha \right\rangle = \alpha^\top \mathbf{K}^\top \alpha = \alpha^\top \mathbf{K} \alpha = \sum_{i,j = 1}^n \alpha_i \alpha_j \left\langle x_i, x_j \right\rangle = \left\lVert \sum_{i=1}^n \alpha_i x_i \right\rVert^2 \geq 0.\end{aligned}\]

  2. Polynomial kernels: A natural generalization of the linear kernel on \(\mathbb R^d\) is the homogeneous polynomial kernel

    \[\begin{aligned} K(x, y) = (\left\langle x,y\right\rangle)^m\end{aligned}\]

    of degree \(m \geq 2\), also defined on \(\mathbb R^d.\) It is clearly a symmetric function. To prove positivity, note that

    \[\begin{aligned} K(x,y) = \left( \sum_{i=1}^d x_i y_i \right)^m.\end{aligned}\]

    This will have \(D = \binom{m+d-1}{m}\) monomials, so to simplify the analysis let's take \(m=2.\) Then

    \[\begin{aligned} K(x,y) = \sum_{i=1}^d x_i^2 y_i^2 + 2 \sum_{i < j} x_i x_j y_i y_j.\end{aligned}\]

    In this case \(D = \binom{d+1}{d} = d + \binom{d}{2}.\) Define a mapping \(\Phi\colon \mathbb R^d \to \mathbb R^D\) such that

    \[\begin{aligned} \Phi(x) = [x_1^2, \ldots, x_d^2, \sqrt{2}x_1 x_2, \ldots, \sqrt{2} x_{d-1} x_d ]^\top.\end{aligned}\]

    Then

    \[\begin{aligned} K(x,y) = \left\langle \Phi(x),\Phi(y)\right\rangle.\end{aligned}\]

    Following the same argument as the first example, we can verify that the gram matrix thus formed is indeed positive.

    The mapping \(x \mapsto \Phi(x)\) is often referred to as a feature map. We see that dealing with elements in the feature space, i.e., the codomain of \(\Phi\), is computationally expensive. The relation \(K(x,y) = \left\langle \Phi(x),\Phi(y)\right\rangle\) allows us compute the inner products using the kernel function instead of actually taking the inner product in a very high dimensional space. We will see that this "kernel trick" holds for many kernel functions when we discuss Mercer's theorem.

  3. Gaussian kernels: Given some compact subset \(\mathcal{X} \subseteq \mathbb R^d\) and \(\sigma \in \mathbb R\), the Gaussian kernel is

    \[\begin{aligned} K(x,y) = \exp{\left( -\frac{1}{2 \sigma^2} \left\lVert x-y\right\rVert^2 \right)}.\end{aligned}\]

    It is not obvious why this is a kernel function.

Equivalence between kernel function and reproducing kernel

Let us return to the characterization of reproducing kernels. We will now prove that a function is a kernel function if and only if there is an RKHS for which it is the reproducing kernel. At this point recall Theorem-3 which states that an RKHS admits a unique reproducing kernel.

Theorem 4: Let \(\mathcal{X}\) be a set and let \(\mathcal{H}\) be an RKHS on \(\mathcal{X}\) with reproducing kernel \(K.\) Then \(K\) is a kernel function.
For some arbitrary \(n \in \mathbb N\) fix some arbitrary collection \(x_1, \ldots, x_n \in \mathcal{X}\) and \(\alpha \in \mathbb{C}^n.\) Then if we denote by \(\mathbf{K}\) the gram matrix of \(K\) with respect to \(x_1, \ldots, x_n\), we have
\[\begin{aligned} \left\langle \mathbf{K} \alpha,\alpha \right\rangle = \sum_{i,j = 1}^n \overline{\alpha_i}\alpha_j K(x_i, x_j) = \sum_{i,j = 1}^n \overline{\alpha_i}\alpha_j \left\langle k_{x_j},k_{x_i} \right\rangle = \left\lVert \sum_{i=1}^n \alpha_i k_{x_i} \right\rVert^2 \geq 0.\end{aligned}\]
And thus \(K\) is a kernel function.

What does it mean in the above proof if we have an equality? That is, if \(\left\langle \mathbf{K} \alpha,\alpha \right\rangle = 0\)? This happens if and only if \(\left\lVert \sum_{i=1}^n \alpha_i k_{x_i} \right\rVert = 0.\) But this means that for every \(f \in \mathcal{H}\) we have \(\sum_{i=1}^n \overline{\alpha_i} f(x_i) = \left\langle f,\sum_i \alpha_i k_{x_i} \right\rangle = 0.\) Thus, in this case there is an equation of linear dependence between the values of every function in \(\mathcal{H}\) at this finite set of points.

Now let us state the converse of Theorem-4. It is a deep result in RKHS theory known as the Moore–Aronszajn theorem.

Theorem 5 [Moore–Aronszajn theorem]: Let \(\mathcal{X}\) be a set and let \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) be a kernel function, then there exists a reproducing kernel Hilbert space \(\mathcal{H}\) of functions on \(\mathcal{X}\) such that \(K\) is the reproducing kernel of \(\mathcal{H}.\)

For a proof see here on Wikipedia.

In light of these two theorems we have the following notation.

Definition 24: Given a kernel function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\), we let \(\mathcal{H}_K\) denote the unique RKHS with the reproducing kernel \(K.\)

It is not an easy problem to start with a kernel function \(K\) on some set \(\mathcal{X}\) and give a concrete description of \(\mathcal{H}_K.\) I will not be discussing this here, but there are plenty of resources available where you can see this problem getting discussed. I recommend Paulsen and Raghupathi's An Introduction to the Theory of Reproducing Kernel Hilbert Spaces.

Mercer's Theorem

Recall the spectral theorem for finite-dimensional vector spaces: A linear operator \(T\colon V \to V\) for some finite dimensional vector space \(V\) on \(\mathbb{C}\) is normal, i.e., \(T T^* = T^* T\), if and only if \(V\) has an orthonormal basis consisting of eigenvectors of \(T.\) This implies that if \(\mathbf{U} = [v_1, \ldots, v_n]\) is a unitary matrix containing the \(i^{\text{th}}\) eigenvector in its \(i^{\text{th}}\) column and \(\mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)\) is a diagonal matrix containing the corresponding eigenvalues, then if \(\mathbf{K}\) is a normal matrix then it can written as

\[\begin{aligned} \mathbf{K} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^\top = \sum_{i=1}^n \lambda_i v_i v_i^\top.\end{aligned}\]

Mercer's theorem generalizes this decomposition to kernel functions. Let us start by defining a special type of kernel function.

Definition 25: Let \(\mathcal{X}\) be a compact metric space. A function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) is called a Mercer kernel if it is a continuous kernel function.

Recall the space \(L^2(\mu)\) from Definition-8, where we now take \(\mathcal{X}\) (written as \(X\) there) to be a compact metric space.

Definition 26: Given a Mercer kernel \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\), we define a linear operator \(T_{K} : L^2(\mu) \to L^2(\mu)\) as
\[\begin{aligned} T_K(f)(x) := \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y), \quad x \in \mathcal{X}.\end{aligned}\]

We assume that the Mercer kernel satisfies the Hilbert-Schmidt condition, stated as

\[\begin{aligned} \int_{\mathcal{X} \times \mathcal{X}} \left\lvert K(x,y) \right\rvert^2 \,\mathrm{d}\mu(x) \,\mathrm{d}\mu(y) < \infty,\end{aligned}\]
which ensures that \(T_K\) is a bounded linear operator on \(L^2(\mu).\)

Indeed, we have

\[\begin{aligned} \lVert T_K(f) \rVert^{2} = \int_{\mathcal{X}} \left\lvert \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y) \right\rvert^2 \,\mathrm{d}\mu(x) \leq \left\lVert f \right\rVert^2 \int_{\mathcal{X} \times \mathcal{X}} \left\lvert K(x,y) \right\rvert^2 \,\mathrm{d}\mu(x) \,\mathrm{d}\mu(y),\end{aligned}\]
where we have applied Schwarz inequality (Theorem-1) as follows
\[\begin{aligned} \left\lvert \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y) \right\rvert^2 = \left\lvert T_K(f)(x) \right\rvert^2 = \left\lvert \left\langle K(x, \cdot),f \right\rangle \right\rvert^2 \leq \left\lVert K(x, \cdot) \right\rVert^2 \left\lVert f \right\rVert^2.\end{aligned}\]

Operators of this type are known as Hilbert-Schmidt operators.

We are now ready to state the Mercer's theorem.

Theorem 5 [Mercer's theorem]: Suppose that \(\mathcal{X}\) is a compact metric space, and \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) is a Mercer's kernel that satisfies the Hilbert-Schmidt condition. Then there exists an at most countable set of eigenfunctions \((e_{i})_{i}\) for \(T_K\) that forms an orthonormal basis of \(L^2(\mu)\), and a corresponding set of non-negative eigenvalues \((\lambda_{i})_{i}\) such that
\[\begin{aligned} T_K(e_i) = \lambda_i e_i, \quad i \in \mathbb N.\end{aligned}\]
Moreover, \(K\) has the expansion
\[\begin{aligned} K(x,y) = \sum_{i} \lambda_i e_i(x) e_i(y), \quad x,y \in \mathcal{X},\end{aligned}\]
where the convergence of the series above holds absolutely and uniformly.

I'll skip the proof.

Among other things, Mercer's theorem provides a framework for embedding an element of \(\mathcal{X}\) into an element of \(\ell^2(\mathbb N)\) (for its definition, see Example-4 in the section on Hilbert spaces in the previous part). More concretely, given the eigenfunctions and eigenvalues guaranteed by Mercer's theorem, we may define a mapping \(\Phi : \mathcal{X} \to \ell^2(\mathbb N)\) as follows

\[\begin{aligned} \mathcal{X} \ni x \mapsto \Phi(x) := \left( \sqrt{\lambda_i} e_i(x) \right)_{i \in \mathbb N} \in \ell^2(\mathbb N).\end{aligned}\]
Therefore, we have
\[\begin{aligned} \left\langle \Phi(x), \Phi(y) \right\rangle = \sum_{i=1}^{\infty} \lambda_i e_i(x) e_i(y) = K(x,y).\end{aligned}\]

This is the well-known "kernel trick". Let us connect Mercer's theorem to the spectral theorem.

Let \(\mathcal{X} = [d] := \{1, 2, \ldots, d\}\) along with the Hamming metric be our compact metric space. Let \(\mu(\{i\}) = 1\) for all \(i \in [d]\) be the counting measure on \(\mathcal{X}.\) Any function \(f : \mathcal{X} \to \mathbb{C}\) is equivalent to the \(d\)-dimensional vector \([f(1), \ldots, f(d)]\), and any kernel function \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{C}\) is continuous, satisfies the Hilbert-Schmidt condition and is equivalent to a \(d \times d\) normal matrix \(\mathbf{K}\) where \(\mathbf{K}_{i,j} = K(i, j).\) The Hilbert-Schmidt operator reduces to

\[\begin{aligned} T_K(f)(x) = \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y) = \sum_{i=1}^{d} K(x,y) f(y).\end{aligned}\]
Mercer's theorem then states that there exists a set of eigenfunctions \(v_1, \ldots, v_d\) (for our \(\mathcal{X}\) they are equivalent to vectors) and the corresponding eigenvalues \(\lambda_1, \ldots, \lambda_d\) such that
\[\begin{aligned} \mathbf{K} = \sum_{i=1}^{d} \lambda_i v_i v_i^T,\end{aligned}\]
which is exactly the spectral theorem.

Operations on Kernels

Let us now consider how various algebraic operations on kernels affect the corresponding Hilbert spaces. All this (and a lot more) can be found in the seminal paper by Aronszajn "Theory of Reproducing Kernels".

I state the following theorems without proof to illustrate how operations on kernels are done.

Sums of kernels

Theorem 6: Suppose that \(\mathcal{H}_1\) and \(\mathcal{H}_2\) are both RKHSs with kernels \(K_1\) and \(K_2\), respectively. Then the space
\[\begin{aligned} \mathcal{H} = \mathcal{H}_1 + \mathcal{H}_2 := \{f_1 + f_2 \, : \, f_1 \in \mathcal{H}_1 \text{ and } f_2 \in \mathcal{H}_2 \}\end{aligned}\]
with the norm
\[\begin{aligned} \lVert f \rVert^{2}_{\mathcal{H}} := \inf \left\{ \lVert f_1 \rVert^{2}_{\mathcal{H}_1} + \lVert f_1 \rVert^{2}_{\mathcal{H}_2} \, : \, f = f_1 + f_2,\, f_1 \in \mathcal{H}_1,\, f_2 \in \mathcal{H}_2 \right\}\end{aligned}\]
is an RKHS with the kernel \(K = K_1 + K_2.\)

Products of kernels

Let us first define the notion of tensor product of two (separable) Hilbert spaces \(\mathcal{H}_1\) and \(\mathcal{H}_2\) of functions, say with domains \(\mathcal{X}_1\) and \(\mathcal{X}_2.\)

Definition 27: Consider the set of functions \(h : \mathcal{X}_1 \times \mathcal{X}_2 \to \mathbb{C}\) satisfying
\[\begin{aligned} \mathcal{H} = \left\{ h = \sum_{i=1}^{n} u_i v_i \, : \, n \in \mathbb N \text{ and } u_i \in \mathcal{H}_1,\, v_i \in \mathcal{H}_2 \text{ for all } i \in [n] \right\}.\end{aligned}\]
We define an inner product on \(\mathcal{H}\) as follows: for \(h = \sum_{i=1}^{n} u_i v_i\) and \(g = \sum_{j=1}^{m} w_j x_j\) in \(\mathcal{H}\) define
\[\begin{aligned} \left\langle h,g \right\rangle := \sum_{i=1}^{n} \sum_{j=1}^{m} \left\langle u_i,w_j \right\rangle_{\mathcal{H}1} \left\langle v_i,x_j \right\rangle_{\mathcal{H}_2}.\end{aligned}\]
Then \(\mathcal{H}\) is a Hilbert space and is called the tensor product of \(\mathcal{H}_1\) and \(\mathcal{H}_2.\) We denote it by \(\mathcal{H} = \mathcal{H}_1 \otimes \mathcal{H}_2.\)

We can now state the theorem for product of kernels.

Theorem 7: Suppose that \(\mathcal{H}_1\) and \(\mathcal{H}_2\) are RKHSs of real-valued functions with domains \(\mathcal{X}_1\) and \(\mathcal{X}_2\), and equipped with kernels \(K_1\) and \(K_2\), respectively. Then the tensor product space \(\mathcal{H} = \mathcal{H}_1 \otimes \mathcal{H}_2\) is an RKHS of functions with domain \(\mathcal{X}_1 \times \mathcal{X}_2\), and with kernel function \(K : (\mathcal{X}_1 \times \mathcal{X}_2) \times (\mathcal{X}_1 \times \mathcal{X}_2) \to \mathbb{C}\) defined by
\[\begin{aligned} K((x,s), (y,t)) := K_1(x,y) K_2(s,t).\end{aligned}\]
\(K\) is called the tensor product of the kernels \(K_1\) and \(K_2\), and denoted by \(K = K_1 \otimes K_2.\)

Other operations

We can similarly define more operations on kernels:

  1. If \(K\) is a valid kernel and \(\alpha \geq 0\), then \(\alpha K\) is a valid kernel.

  2. If \(K\) is a valid kernel and \(\alpha \geq 0\), then \(K + \alpha\) is a valid kernel.

  3. We can easily see from all these results that a linear combination or more generally for any polynomial \(P\) with positive coefficients, the composition \(P \circ K\) is a valid kernel if \(K\) is a valid kernel.

  4. If \(K\) is a valid kernel, then \(\exp(K)\) is a valid kernel.

Representer theorem

We are now at a stage where we can put all this theory to use in machine learning. Specifically we will develop Representer theorem which allows many optimization problems over the RKHS to be reduced to relatively simple calculations involving the gram matrix.

Let us start with a functional analytic viewpoint of supervised learning. Suppose we are given empirical data

\[\begin{aligned} (x_1, y_1), \ldots, (x_n, y_n) \in \mathcal{X} \times \mathcal Y,\end{aligned}\]
where \(\mathcal{X}\) is a nonempty set, and for now \(\mathcal Y = \mathbb R.\) The output is from an unknown function \(g : \mathcal{X} \to \mathbb R\), i.e., we assume
\[\begin{aligned} y_i = g(x_i), \quad i \in [n].\end{aligned}\]

We need to find some function \(f^*\) which "best" approximates \(g.\) A natural way to formalize the notion of "best" is to limit ourselves to an RKHS \(\mathcal{H}\) which contains functions of the form \(f : \mathcal{X} \to \mathbb R\) and choose

\[\begin{aligned} f^* = \argmin_{f \in \mathcal{H}} \left\lVert f \right\rVert \text{ such that } f^*(x_i) = y_i \text{ for } i \in [n].\end{aligned}\]

This optimization problem is feasible whenever there exists at least one function \(f \in \mathcal{H}\) that fits the data exactly. Denote by \(y\) the vector \([y_1, \ldots, y_n]^\top.\) It can be shown that if \(\mathbf{K}\) is the gram matrix of the kernel \(K\) with respect to \(x_1, \ldots, x_n\), then the feasibility is equivalent to \(y \in \text{range}(\mathbf{K}).\) As we will see, this is a special case of the representer theorem.

In a realistic setting we assume that we have noisy observations, i.e.,

\[\begin{aligned} y_i = g(x_i) + \varepsilon_i, \quad i \in [n],\end{aligned}\]
where \(\varepsilon_i\)'s denote the noise. Then the constraint of exact fitting is no longer desirable, and we model the "best" approximation by introducing a loss function which represents how close our approximation is to the observed outputs. More concretely, let \(L_y : \mathbb R^n \to \mathbb R\) be a continuous function. Then we can define our cost function as
\[\begin{aligned} J(f) = \lVert f \rVert^2 + L_y(f(x_1), \ldots, f(x_n)).\end{aligned}\]

Theorem 8 (Representer theorem): If \(f^*\) is a function such that \[\begin{aligned} J(f^*) = \inf_{f \in \mathcal{H}} J(f),\end{aligned}\] then \(f^*\) is in the span of the functions \(k_{x_1}, \ldots, k_{x_n}\), i.e.,
\[\begin{aligned} f^*(\cdot) = \sum_{i=1}^{n} \alpha_i k_{x_i}(\cdot) \quad \text{for some } \alpha_1, \ldots, \alpha_n \in \mathbb{C}.\end{aligned}\]

I'll skip the proof which can be found in Paulsen and Raghupathi's book An Introduction to the Theory of Reproducing Kernel Hilbert Spaces or in Schölkopf, Herbrich and Smola's paper A Generalized Representer Theorem.

As an example \(L_y\) could be the squared loss \(L_y(f(x_1), \ldots, f(x_n)) = \sum_{i=1}^n (y_i - f(x_i))^2.\) If we assume \(L_y\) to be convex then the solution to (1) exists and is unique. Recall that \(E \subseteq \mathbb R^k\) is a convex set if \(\lambda x + (1-\lambda) y \in E\) whenever \(x,y \in E\) and \(0 \le \lambda \le 1.\) A real-valued function \(L\) defined on a convex set \(E\) is a convex function if \(L(\lambda x + (1-\lambda) y) \leq \lambda L(x) + (1-\lambda) L(y)\) whenever \(x, y \in E\) and \(0 \le \lambda \le 1.\) Note that a convex function is continuous on the interior of its domain. Also check out Alexandrov theorem and Rademacher’s theorem.

The Representer theorem allows us to reduce the infinite-dimensional problem of optimizing over an RKHS to an \(n\)-dimensional problem.

Epilogue

I barely scratched the surface of reproducing kernel Hilbert space theory. Some resources I recommend that do a much better job than me in explaining this theory and which I used as reference are:

  1. Paulsen V and Raghupathi M: An Introduction to the Theory of Reproducing Kernel Hilbert Spaces

  2. Wainwright M: High-Dimensional Statistics

  3. Schölkopf B, Herbrich R and Smola A.J: A Generalized Representer Theorem (COLT 2001)

  4. Aronszajn N: Theory of Reproducing Kernels (1950)

In the next article I will discuss kernel approximation methods. In particular I will focus on random Fourier Features developed by Rahimi and Recht which I am using in my work.