Aditya Makkar
Kernels - Part 1
  1. Introduction
  2. Characterization of Reproducing Kernels
    1. Examples
    2. Equivalence between kernel function and reproducing kernel
  3. Mercer's Theorem
  4. Operations on Kernels
    1. Sums of kernels
    2. Products of kernels
    3. Other operations
  5. Representer theorem
  6. Epilogue

Introduction

This is the second blog post in the series of blog posts on kernels. In the first part I introduced the functional analysis background for kernel theory. I highly recommend you read it before continuing. I will frequently refer to it and use the same notation. In this blog post I aim to introduce the fundamental theorems like Mercer's theorem and Representer theorem.

Characterization of Reproducing Kernels

We already defined the notion of reproducing kernels for an RKHS (Definition 20). We now turn our attention to obtaining necessary and sufficient conditions for a function K ⁣:X×XCK \colon \mathcal{X} \times \mathcal{X} \to \mathbb{C} to be the reproducing kernel for some RKHS. But first we must define kernel functions.

Definition 21: Let A=(ai,j)A = (a_{i,j}) be an n×nn \times n complex matrix. Then AA is called positive if for every α1,,αnC\alpha_1, \ldots, \alpha_n \in \mathbb{C} we have
i,j=1nαiαjai,j0.\begin{aligned} \sum_{i,j = 1}^{n} \overline{\alpha_i}\alpha_j a_{i,j} \geq 0.\end{aligned}
We denote this by A0.A \geq 0.

Note that if we define a vector xCnx \in \mathbb{C}^n to be such that its ithi^{\text{th}} component is αi\alpha_i, then the condition above can rewritten as

Ax,x0.\begin{aligned} \left\langle Ax,x \right\rangle \ge 0.\end{aligned}

Also note that if A0A \geq 0 then A=AA = A^*, where A=AA^* = \overline{A^\top} denotes the Hermitian matrix (also known as the self-adjoint matrix). Therefore, positivity gives self-adjoint property for free if we are dealing with complex matrices. Things aren't so elegant for real matrices. For the real case we need to explicitly state that the matrix AA is also symmetric apart from what's stated in the definition above. Therefore, we often use the following as the definition of positive matrices:

Definition 21': An n×nn \times n matrix AA is positive, in symbols A0A \geq 0, if it is self-adjoint and if Ax,x0\left\langle Ax,x \right\rangle \ge 0 for all xCn.x \in \mathbb{C}^n.

Positive matrices are also alternatively called positive semidefinite or nonnegative matrices.

The following lemma connects the concept of positive matrices to its eigenvalues.

Lemma 2: A matrix A0A \geq 0 if and only if A=AA = A^* and every eigenvalue of AA is nonnegative.

Let us first suppose A0A \geq 0, then A=AA = A^* by definition. Now if λ\lambda is an eigenvalue of AA and vv is an eigenvector of AA corresponding to the eigenvalue λ\lambda, we have

0Av,v=λv,v=λv,v.\begin{aligned} 0 \leq \left\langle Av,v \right\rangle = \left\langle \lambda v,v \right\rangle = \lambda \left\langle v,v \right\rangle.\end{aligned}
Since v,v0\left\langle v,v \right\rangle \ge 0, λ\lambda is a nonnegative number.

For the other side, find an orthonormal basis v1,,vnv_1, \ldots, v_n consisting of eigenvectors of AA (it exists by the spectral theorem). Let viv_i be the eigenvector corresponding to the eigenvalue λi.\lambda_i. Then for any x=i=1nαivix = \sum_{i=1}^n \alpha_i v_i we have Ax,x=i=1nλiαi2.\left\langle Ax,x \right\rangle = \sum_{i=1}^n \lambda_i \lvert \alpha_i\rvert^2. Since λi0\lambda_i \geq 0, Ax,x0\left\langle Ax,x \right\rangle \geq 0 and AA must be positive.

We are now ready to define a kernel function.

Definition 22: Let X\mathcal{X} be a set, then K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C} is called a kernel function if for every nNn \in \mathbb N and for every choice of {x1,,xn}X\{x_1, \ldots, x_n\} \subseteq \mathcal{X}, the matrix (K(xi,xj))0.(K(x_i, x_j)) \geq 0. We will use the notation K0K \geq 0 to denote that the function KK is a kernel function.

Kernel functions are alternatively called positive definite functions or positive semidefinite functions.

Definition 23: Given a kernel function K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb C and points x1,,xnXx_1, \ldots, x_n \in \mathcal{X}, the n×nn \times n matrix (K(xi,xj))(K(x_i, x_j)), denoted K\mathbf{K}, is called the Gram matrix of KK with respect to x1,,xn.x_1, \ldots, x_n.

Some examples follow:

Examples

  1. Linear kernels: When X=Rd\mathcal{X} = \mathbb R^d, we can define the linear kernel function as

    K(x,y)=x,y.\begin{aligned} K(x,y) = \left\langle x,y\right\rangle.\end{aligned}
    It is clearly a symmetric function of its arguments, and hence self-adjoint. To prove positivity, let x1,,xnRdx_1, \ldots, x_n \in \mathbb R^d be an arbitrary collection of points, and consider its gram matrix K\mathbf{K}, i.e., Ki,j=K(xi,xj)=xi,xj.\mathbf{K}_{i,j} = K(x_i, x_j) = \left\langle x_i,x_j\right\rangle. Then for any αRn\alpha \in \mathbb R^n, we have
    Kα,α=αKα=αKα=i,j=1nαiαjxi,xj=i=1nαixi20.\begin{aligned} \left\langle \mathbf{K} \alpha, \alpha \right\rangle = \alpha^\top \mathbf{K}^\top \alpha = \alpha^\top \mathbf{K} \alpha = \sum_{i,j = 1}^n \alpha_i \alpha_j \left\langle x_i, x_j \right\rangle = \left\lVert \sum_{i=1}^n \alpha_i x_i \right\rVert^2 \geq 0.\end{aligned}

  2. Polynomial kernels: A natural generalization of the linear kernel on Rd\mathbb R^d is the homogeneous polynomial kernel

    K(x,y)=(x,y)m\begin{aligned} K(x, y) = (\left\langle x,y\right\rangle)^m\end{aligned}

    of degree m2m \geq 2, also defined on Rd.\mathbb R^d. It is clearly a symmetric function. To prove positivity, note that

    K(x,y)=(i=1dxiyi)m.\begin{aligned} K(x,y) = \left( \sum_{i=1}^d x_i y_i \right)^m.\end{aligned}

    This will have D=(m+d1m)D = \binom{m+d-1}{m} monomials, so to simplify the analysis let's take m=2.m=2. Then

    K(x,y)=i=1dxi2yi2+2i<jxixjyiyj.\begin{aligned} K(x,y) = \sum_{i=1}^d x_i^2 y_i^2 + 2 \sum_{i < j} x_i x_j y_i y_j.\end{aligned}

    In this case D=(d+1d)=d+(d2).D = \binom{d+1}{d} = d + \binom{d}{2}. Define a mapping Φ ⁣:RdRD\Phi\colon \mathbb R^d \to \mathbb R^D such that

    Φ(x)=[x12,,xd2,2x1x2,,2xd1xd].\begin{aligned} \Phi(x) = [x_1^2, \ldots, x_d^2, \sqrt{2}x_1 x_2, \ldots, \sqrt{2} x_{d-1} x_d ]^\top.\end{aligned}

    Then

    K(x,y)=Φ(x),Φ(y).\begin{aligned} K(x,y) = \left\langle \Phi(x),\Phi(y)\right\rangle.\end{aligned}

    Following the same argument as the first example, we can verify that the gram matrix thus formed is indeed positive.

    The mapping xΦ(x)x \mapsto \Phi(x) is often referred to as a feature map. We see that dealing with elements in the feature space, i.e., the codomain of Φ\Phi, is computationally expensive. The relation K(x,y)=Φ(x),Φ(y)K(x,y) = \left\langle \Phi(x),\Phi(y)\right\rangle allows us compute the inner products using the kernel function instead of actually taking the inner product in a very high dimensional space. We will see that this "kernel trick" holds for many kernel functions when we discuss Mercer's theorem.

  3. Gaussian kernels: Given some compact subset XRd\mathcal{X} \subseteq \mathbb R^d and σR\sigma \in \mathbb R, the Gaussian kernel is

    K(x,y)=exp(12σ2xy2).\begin{aligned} K(x,y) = \exp{\left( -\frac{1}{2 \sigma^2} \left\lVert x-y\right\rVert^2 \right)}.\end{aligned}

    It is not obvious why this is a kernel function.

Equivalence between kernel function and reproducing kernel

Let us return to the characterization of reproducing kernels. We will now prove that a function is a kernel function if and only if there is an RKHS for which it is the reproducing kernel. At this point recall Theorem-3 which states that an RKHS admits a unique reproducing kernel.

Theorem 4: Let X\mathcal{X} be a set and let H\mathcal{H} be an RKHS on X\mathcal{X} with reproducing kernel K.K. Then KK is a kernel function.
For some arbitrary nNn \in \mathbb N fix some arbitrary collection x1,,xnXx_1, \ldots, x_n \in \mathcal{X} and αCn.\alpha \in \mathbb{C}^n. Then if we denote by K\mathbf{K} the gram matrix of KK with respect to x1,,xnx_1, \ldots, x_n, we have
Kα,α=i,j=1nαiαjK(xi,xj)=i,j=1nαiαjkxj,kxi=i=1nαikxi20.\begin{aligned} \left\langle \mathbf{K} \alpha,\alpha \right\rangle = \sum_{i,j = 1}^n \overline{\alpha_i}\alpha_j K(x_i, x_j) = \sum_{i,j = 1}^n \overline{\alpha_i}\alpha_j \left\langle k_{x_j},k_{x_i} \right\rangle = \left\lVert \sum_{i=1}^n \alpha_i k_{x_i} \right\rVert^2 \geq 0.\end{aligned}
And thus KK is a kernel function.

What does it mean in the above proof if we have an equality? That is, if Kα,α=0\left\langle \mathbf{K} \alpha,\alpha \right\rangle = 0? This happens if and only if i=1nαikxi=0.\left\lVert \sum_{i=1}^n \alpha_i k_{x_i} \right\rVert = 0. But this means that for every fHf \in \mathcal{H} we have i=1nαif(xi)=f,iαikxi=0.\sum_{i=1}^n \overline{\alpha_i} f(x_i) = \left\langle f,\sum_i \alpha_i k_{x_i} \right\rangle = 0. Thus, in this case there is an equation of linear dependence between the values of every function in H\mathcal{H} at this finite set of points.

Now let us state the converse of Theorem-4. It is a deep result in RKHS theory known as the Moore–Aronszajn theorem.

Theorem 5 [Moore–Aronszajn theorem]: Let X\mathcal{X} be a set and let K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C} be a kernel function, then there exists a reproducing kernel Hilbert space H\mathcal{H} of functions on X\mathcal{X} such that KK is the reproducing kernel of H.\mathcal{H}.

For a proof see here on Wikipedia.

In light of these two theorems we have the following notation.

Definition 24: Given a kernel function K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C}, we let HK\mathcal{H}_K denote the unique RKHS with the reproducing kernel K.K.

It is not an easy problem to start with a kernel function KK on some set X\mathcal{X} and give a concrete description of HK.\mathcal{H}_K. I will not be discussing this here, but there are plenty of resources available where you can see this problem getting discussed. I recommend Paulsen and Raghupathi's An Introduction to the Theory of Reproducing Kernel Hilbert Spaces.

Mercer's Theorem

Recall the spectral theorem for finite-dimensional vector spaces: A linear operator T ⁣:VVT\colon V \to V for some finite dimensional vector space VV on C\mathbb{C} is normal, i.e., TT=TTT T^* = T^* T, if and only if VV has an orthonormal basis consisting of eigenvectors of T.T. This implies that if U=[v1,,vn]\mathbf{U} = [v_1, \ldots, v_n] is a unitary matrix containing the ithi^{\text{th}} eigenvector in its ithi^{\text{th}} column and Λ=diag(λ1,,λn)\mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n) is a diagonal matrix containing the corresponding eigenvalues, then if K\mathbf{K} is a normal matrix then it can written as

K=UΛU=i=1nλivivi.\begin{aligned} \mathbf{K} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^\top = \sum_{i=1}^n \lambda_i v_i v_i^\top.\end{aligned}

Mercer's theorem generalizes this decomposition to kernel functions. Let us start by defining a special type of kernel function.

Definition 25: Let X\mathcal{X} be a compact metric space. A function K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C} is called a Mercer kernel if it is a continuous kernel function.

Recall the space L2(μ)L^2(\mu) from Definition-8, where we now take X\mathcal{X} (written as XX there) to be a compact metric space.

Definition 26: Given a Mercer kernel K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C}, we define a linear operator TK:L2(μ)L2(μ)T_{K} : L^2(\mu) \to L^2(\mu) as
TK(f)(x):=XK(x,y)f(y)dμ(y),xX.\begin{aligned} T_K(f)(x) := \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y), \quad x \in \mathcal{X}.\end{aligned}

We assume that the Mercer kernel satisfies the Hilbert-Schmidt condition, stated as

X×XK(x,y)2dμ(x)dμ(y)<,\begin{aligned} \int_{\mathcal{X} \times \mathcal{X}} \left\lvert K(x,y) \right\rvert^2 \,\mathrm{d}\mu(x) \,\mathrm{d}\mu(y) < \infty,\end{aligned}
which ensures that TKT_K is a bounded linear operator on L2(μ).L^2(\mu).

Indeed, we have

TK(f)2=XXK(x,y)f(y)dμ(y)2dμ(x)f2X×XK(x,y)2dμ(x)dμ(y),\begin{aligned} \lVert T_K(f) \rVert^{2} = \int_{\mathcal{X}} \left\lvert \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y) \right\rvert^2 \,\mathrm{d}\mu(x) \leq \left\lVert f \right\rVert^2 \int_{\mathcal{X} \times \mathcal{X}} \left\lvert K(x,y) \right\rvert^2 \,\mathrm{d}\mu(x) \,\mathrm{d}\mu(y),\end{aligned}
where we have applied Schwarz inequality (Theorem-1) as follows
XK(x,y)f(y)dμ(y)2=TK(f)(x)2=K(x,),f2K(x,)2f2.\begin{aligned} \left\lvert \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y) \right\rvert^2 = \left\lvert T_K(f)(x) \right\rvert^2 = \left\lvert \left\langle K(x, \cdot),f \right\rangle \right\rvert^2 \leq \left\lVert K(x, \cdot) \right\rVert^2 \left\lVert f \right\rVert^2.\end{aligned}

Operators of this type are known as Hilbert-Schmidt operators.

We are now ready to state the Mercer's theorem.

Theorem 5 [Mercer's theorem]: Suppose that X\mathcal{X} is a compact metric space, and K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C} is a Mercer's kernel that satisfies the Hilbert-Schmidt condition. Then there exists an at most countable set of eigenfunctions (ei)i(e_{i})_{i} for TKT_K that forms an orthonormal basis of L2(μ)L^2(\mu), and a corresponding set of non-negative eigenvalues (λi)i(\lambda_{i})_{i} such that
TK(ei)=λiei,iN.\begin{aligned} T_K(e_i) = \lambda_i e_i, \quad i \in \mathbb N.\end{aligned}
Moreover, KK has the expansion
K(x,y)=iλiei(x)ei(y),x,yX,\begin{aligned} K(x,y) = \sum_{i} \lambda_i e_i(x) e_i(y), \quad x,y \in \mathcal{X},\end{aligned}
where the convergence of the series above holds absolutely and uniformly.

I'll skip the proof.

Among other things, Mercer's theorem provides a framework for embedding an element of X\mathcal{X} into an element of 2(N)\ell^2(\mathbb N) (for its definition, see Example-4 in the section on Hilbert spaces in the previous part). More concretely, given the eigenfunctions and eigenvalues guaranteed by Mercer's theorem, we may define a mapping Φ:X2(N)\Phi : \mathcal{X} \to \ell^2(\mathbb N) as follows

XxΦ(x):=(λiei(x))iN2(N).\begin{aligned} \mathcal{X} \ni x \mapsto \Phi(x) := \left( \sqrt{\lambda_i} e_i(x) \right)_{i \in \mathbb N} \in \ell^2(\mathbb N).\end{aligned}
Therefore, we have
Φ(x),Φ(y)=i=1λiei(x)ei(y)=K(x,y).\begin{aligned} \left\langle \Phi(x), \Phi(y) \right\rangle = \sum_{i=1}^{\infty} \lambda_i e_i(x) e_i(y) = K(x,y).\end{aligned}

This is the well-known "kernel trick". Let us connect Mercer's theorem to the spectral theorem.

Let X=[d]:={1,2,,d}\mathcal{X} = [d] := \{1, 2, \ldots, d\} along with the Hamming metric be our compact metric space. Let μ({i})=1\mu(\{i\}) = 1 for all i[d]i \in [d] be the counting measure on X.\mathcal{X}. Any function f:XCf : \mathcal{X} \to \mathbb{C} is equivalent to the dd-dimensional vector [f(1),,f(d)][f(1), \ldots, f(d)], and any kernel function K:X×XCK : \mathcal{X} \times \mathcal{X} \to \mathbb{C} is continuous, satisfies the Hilbert-Schmidt condition and is equivalent to a d×dd \times d normal matrix K\mathbf{K} where Ki,j=K(i,j).\mathbf{K}_{i,j} = K(i, j). The Hilbert-Schmidt operator reduces to

TK(f)(x)=XK(x,y)f(y)dμ(y)=i=1dK(x,y)f(y).\begin{aligned} T_K(f)(x) = \int_{\mathcal{X}} K(x, y) f(y) \,\mathrm{d}\mu(y) = \sum_{i=1}^{d} K(x,y) f(y).\end{aligned}
Mercer's theorem then states that there exists a set of eigenfunctions v1,,vdv_1, \ldots, v_d (for our X\mathcal{X} they are equivalent to vectors) and the corresponding eigenvalues λ1,,λd\lambda_1, \ldots, \lambda_d such that
K=i=1dλiviviT,\begin{aligned} \mathbf{K} = \sum_{i=1}^{d} \lambda_i v_i v_i^T,\end{aligned}
which is exactly the spectral theorem.

Operations on Kernels

Let us now consider how various algebraic operations on kernels affect the corresponding Hilbert spaces. All this (and a lot more) can be found in the seminal paper by Aronszajn "Theory of Reproducing Kernels".

I state the following theorems without proof to illustrate how operations on kernels are done.

Sums of kernels

Theorem 6: Suppose that H1\mathcal{H}_1 and H2\mathcal{H}_2 are both RKHSs with kernels K1K_1 and K2K_2, respectively. Then the space
H=H1+H2:={f1+f2:f1H1 and f2H2}\begin{aligned} \mathcal{H} = \mathcal{H}_1 + \mathcal{H}_2 := \{f_1 + f_2 \, : \, f_1 \in \mathcal{H}_1 \text{ and } f_2 \in \mathcal{H}_2 \}\end{aligned}
with the norm
fH2:=inf{f1H12+f1H22:f=f1+f2,f1H1,f2H2}\begin{aligned} \lVert f \rVert^{2}_{\mathcal{H}} := \inf \left\{ \lVert f_1 \rVert^{2}_{\mathcal{H}_1} + \lVert f_1 \rVert^{2}_{\mathcal{H}_2} \, : \, f = f_1 + f_2,\, f_1 \in \mathcal{H}_1,\, f_2 \in \mathcal{H}_2 \right\}\end{aligned}
is an RKHS with the kernel K=K1+K2.K = K_1 + K_2.

Products of kernels

Let us first define the notion of tensor product of two (separable) Hilbert spaces H1\mathcal{H}_1 and H2\mathcal{H}_2 of functions, say with domains X1\mathcal{X}_1 and X2.\mathcal{X}_2.

Definition 27: Consider the set of functions h:X1×X2Ch : \mathcal{X}_1 \times \mathcal{X}_2 \to \mathbb{C} satisfying
H={h=i=1nuivi:nN and uiH1,viH2 for all i[n]}.\begin{aligned} \mathcal{H} = \left\{ h = \sum_{i=1}^{n} u_i v_i \, : \, n \in \mathbb N \text{ and } u_i \in \mathcal{H}_1,\, v_i \in \mathcal{H}_2 \text{ for all } i \in [n] \right\}.\end{aligned}
We define an inner product on H\mathcal{H} as follows: for h=i=1nuivih = \sum_{i=1}^{n} u_i v_i and g=j=1mwjxjg = \sum_{j=1}^{m} w_j x_j in H\mathcal{H} define
h,g:=i=1nj=1mui,wjH1vi,xjH2.\begin{aligned} \left\langle h,g \right\rangle := \sum_{i=1}^{n} \sum_{j=1}^{m} \left\langle u_i,w_j \right\rangle_{\mathcal{H}1} \left\langle v_i,x_j \right\rangle_{\mathcal{H}_2}.\end{aligned}
Then H\mathcal{H} is a Hilbert space and is called the tensor product of H1\mathcal{H}_1 and H2.\mathcal{H}_2. We denote it by H=H1H2.\mathcal{H} = \mathcal{H}_1 \otimes \mathcal{H}_2.

We can now state the theorem for product of kernels.

Theorem 7: Suppose that H1\mathcal{H}_1 and H2\mathcal{H}_2 are RKHSs of real-valued functions with domains X1\mathcal{X}_1 and X2\mathcal{X}_2, and equipped with kernels K1K_1 and K2K_2, respectively. Then the tensor product space H=H1H2\mathcal{H} = \mathcal{H}_1 \otimes \mathcal{H}_2 is an RKHS of functions with domain X1×X2\mathcal{X}_1 \times \mathcal{X}_2, and with kernel function K:(X1×X2)×(X1×X2)CK : (\mathcal{X}_1 \times \mathcal{X}_2) \times (\mathcal{X}_1 \times \mathcal{X}_2) \to \mathbb{C} defined by
K((x,s),(y,t)):=K1(x,y)K2(s,t).\begin{aligned} K((x,s), (y,t)) := K_1(x,y) K_2(s,t).\end{aligned}
KK is called the tensor product of the kernels K1K_1 and K2K_2, and denoted by K=K1K2.K = K_1 \otimes K_2.

Other operations

We can similarly define more operations on kernels:

  1. If KK is a valid kernel and α0\alpha \geq 0, then αK\alpha K is a valid kernel.

  2. If KK is a valid kernel and α0\alpha \geq 0, then K+αK + \alpha is a valid kernel.

  3. We can easily see from all these results that a linear combination or more generally for any polynomial PP with positive coefficients, the composition PKP \circ K is a valid kernel if KK is a valid kernel.

  4. If KK is a valid kernel, then exp(K)\exp(K) is a valid kernel.

Representer theorem

We are now at a stage where we can put all this theory to use in machine learning. Specifically we will develop Representer theorem which allows many optimization problems over the RKHS to be reduced to relatively simple calculations involving the gram matrix.

Let us start with a functional analytic viewpoint of supervised learning. Suppose we are given empirical data

(x1,y1),,(xn,yn)X×Y,\begin{aligned} (x_1, y_1), \ldots, (x_n, y_n) \in \mathcal{X} \times \mathcal Y,\end{aligned}
where X\mathcal{X} is a nonempty set, and for now Y=R.\mathcal Y = \mathbb R. The output is from an unknown function g:XRg : \mathcal{X} \to \mathbb R, i.e., we assume
yi=g(xi),i[n].\begin{aligned} y_i = g(x_i), \quad i \in [n].\end{aligned}

We need to find some function ff^* which "best" approximates g.g. A natural way to formalize the notion of "best" is to limit ourselves to an RKHS H\mathcal{H} which contains functions of the form f:XRf : \mathcal{X} \to \mathbb R and choose

f=arg minfHf such that f(xi)=yi for i[n].\begin{aligned} f^* = \argmin_{f \in \mathcal{H}} \left\lVert f \right\rVert \text{ such that } f^*(x_i) = y_i \text{ for } i \in [n].\end{aligned}

This optimization problem is feasible whenever there exists at least one function fHf \in \mathcal{H} that fits the data exactly. Denote by yy the vector [y1,,yn].[y_1, \ldots, y_n]^\top. It can be shown that if K\mathbf{K} is the gram matrix of the kernel KK with respect to x1,,xnx_1, \ldots, x_n, then the feasibility is equivalent to yrange(K).y \in \text{range}(\mathbf{K}). As we will see, this is a special case of the representer theorem.

In a realistic setting we assume that we have noisy observations, i.e.,

yi=g(xi)+εi,i[n],\begin{aligned} y_i = g(x_i) + \varepsilon_i, \quad i \in [n],\end{aligned}
where εi\varepsilon_i's denote the noise. Then the constraint of exact fitting is no longer desirable, and we model the "best" approximation by introducing a loss function which represents how close our approximation is to the observed outputs. More concretely, let Ly:RnRL_y : \mathbb R^n \to \mathbb R be a continuous function. Then we can define our cost function as
J(f)=f2+Ly(f(x1),,f(xn)).\begin{aligned} J(f) = \lVert f \rVert^2 + L_y(f(x_1), \ldots, f(x_n)).\end{aligned}

Theorem 8 (Representer theorem): If ff^* is a function such that J(f)=inffHJ(f),\begin{aligned} J(f^*) = \inf_{f \in \mathcal{H}} J(f),\end{aligned} then ff^* is in the span of the functions kx1,,kxnk_{x_1}, \ldots, k_{x_n}, i.e.,
f()=i=1nαikxi()for some α1,,αnC.\begin{aligned} f^*(\cdot) = \sum_{i=1}^{n} \alpha_i k_{x_i}(\cdot) \quad \text{for some } \alpha_1, \ldots, \alpha_n \in \mathbb{C}.\end{aligned}

I'll skip the proof which can be found in Paulsen and Raghupathi's book An Introduction to the Theory of Reproducing Kernel Hilbert Spaces or in Schölkopf, Herbrich and Smola's paper A Generalized Representer Theorem.

As an example LyL_y could be the squared loss Ly(f(x1),,f(xn))=i=1n(yif(xi))2.L_y(f(x_1), \ldots, f(x_n)) = \sum_{i=1}^n (y_i - f(x_i))^2. If we assume LyL_y to be convex then the solution to (1) exists and is unique. Recall that ERkE \subseteq \mathbb R^k is a convex set if λx+(1λ)yE\lambda x + (1-\lambda) y \in E whenever x,yEx,y \in E and 0λ1.0 \le \lambda \le 1. A real-valued function LL defined on a convex set EE is a convex function if L(λx+(1λ)y)λL(x)+(1λ)L(y)L(\lambda x + (1-\lambda) y) \leq \lambda L(x) + (1-\lambda) L(y) whenever x,yEx, y \in E and 0λ1.0 \le \lambda \le 1. Note that a convex function is continuous on the interior of its domain. Also check out Alexandrov theorem and Rademacher’s theorem.

The Representer theorem allows us to reduce the infinite-dimensional problem of optimizing over an RKHS to an nn-dimensional problem.

Epilogue

I barely scratched the surface of reproducing kernel Hilbert space theory. Some resources I recommend that do a much better job than me in explaining this theory and which I used as reference are:

  1. Paulsen V and Raghupathi M: An Introduction to the Theory of Reproducing Kernel Hilbert Spaces

  2. Wainwright M: High-Dimensional Statistics

  3. Schölkopf B, Herbrich R and Smola A.J: A Generalized Representer Theorem (COLT 2001)

  4. Aronszajn N: Theory of Reproducing Kernels (1950)

In the next article I will discuss kernel approximation methods. In particular I will focus on random Fourier Features developed by Rahimi and Recht which I am using in my work.