This is the second blog post in the series of blog posts on kernels. In the first part I introduced the functional analysis background for kernel theory. I highly recommend you read it before continuing. I will frequently refer to it and use the same notation. In this blog post I aim to introduce the fundamental theorems like Mercer's theorem and Representer theorem.
We already defined the notion of reproducing kernels for an RKHS (Definition 20). We now turn our attention to obtaining necessary and sufficient conditions for a function K:X×X→C to be the reproducing kernel for some RKHS. But first we must define kernel functions.
Definition 21: Let A=(ai,j) be an n×n complex matrix. Then A is called positive if for every α1,…,αn∈C we have
i,j=1∑nαiαjai,j≥0.
We denote this by A≥0.
Note that if we define a vector x∈Cn to be such that its ith component is αi, then the condition above can rewritten as
⟨Ax,x⟩≥0.
Also note that if A≥0 then A=A∗, where A∗=A⊤ denotes the Hermitian matrix (also known as the self-adjoint matrix). Therefore, positivity gives self-adjoint property for free if we are dealing with complex matrices. Things aren't so elegant for real matrices. For the real case we need to explicitly state that the matrix A is also symmetric apart from what's stated in the definition above. Therefore, we often use the following as the definition of positive matrices:
Definition 21': An n×n matrix A is positive, in symbols A≥0, if it is self-adjoint and if ⟨Ax,x⟩≥0 for all x∈Cn.
Positive matrices are also alternatively called positive semidefinite or nonnegative matrices.
The following lemma connects the concept of positive matrices to its eigenvalues.
Lemma 2: A matrix A≥0 if and only if A=A∗ and every eigenvalue of A is nonnegative.
Let us first suppose A≥0, then A=A∗ by definition. Now if λ is an eigenvalue of A and v is an eigenvector of A corresponding to the eigenvalue λ, we have
0≤⟨Av,v⟩=⟨λv,v⟩=λ⟨v,v⟩.
Since ⟨v,v⟩≥0, λ is a nonnegative number.
For the other side, find an orthonormal basis v1,…,vn consisting of eigenvectors of A (it exists by the spectral theorem). Let vi be the eigenvector corresponding to the eigenvalue λi. Then for any x=∑i=1nαivi we have ⟨Ax,x⟩=∑i=1nλi∣αi∣2. Since λi≥0, ⟨Ax,x⟩≥0 and A must be positive.
We are now ready to define a kernel function.
Definition 22: Let X be a set, then K:X×X→C is called a kernel function if for every n∈N and for every choice of {x1,…,xn}⊆X, the matrix (K(xi,xj))≥0. We will use the notation K≥0 to denote that the function K is a kernel function.
Kernel functions are alternatively called positive definite functions or positive semidefinite functions.
Definition 23: Given a kernel function K:X×X→C and points x1,…,xn∈X, the n×n matrix (K(xi,xj)), denoted K, is called the Gram matrix of K with respect to x1,…,xn.
Linear kernels: When X=Rd, we can define the linear kernel function as
K(x,y)=⟨x,y⟩.
It is clearly a symmetric function of its arguments, and hence self-adjoint. To prove positivity, let x1,…,xn∈Rd be an arbitrary collection of points, and consider its gram matrix K, i.e., Ki,j=K(xi,xj)=⟨xi,xj⟩. Then for any α∈Rn, we have
Polynomial kernels: A natural generalization of the linear kernel on Rd is the homogeneous polynomial kernel
K(x,y)=(⟨x,y⟩)m
of degree m≥2, also defined on Rd. It is clearly a symmetric function. To prove positivity, note that
K(x,y)=(i=1∑dxiyi)m.
This will have D=(mm+d−1) monomials, so to simplify the analysis let's take m=2. Then
K(x,y)=i=1∑dxi2yi2+2i<j∑xixjyiyj.
In this case D=(dd+1)=d+(2d). Define a mapping Φ:Rd→RD such that
Φ(x)=[x12,…,xd2,2x1x2,…,2xd−1xd]⊤.
Then
K(x,y)=⟨Φ(x),Φ(y)⟩.
Following the same argument as the first example, we can verify that the gram matrix thus formed is indeed positive.
The mapping x↦Φ(x) is often referred to as a feature map. We see that dealing with elements in the feature space, i.e., the codomain of Φ, is computationally expensive. The relation K(x,y)=⟨Φ(x),Φ(y)⟩ allows us compute the inner products using the kernel function instead of actually taking the inner product in a very high dimensional space. We will see that this "kernel trick" holds for many kernel functions when we discuss Mercer's theorem.
Gaussian kernels: Given some compact subset X⊆Rd and σ∈R, the Gaussian kernel is
Let us return to the characterization of reproducing kernels. We will now prove that a function is a kernel function if and only if there is an RKHS for which it is the reproducing kernel. At this point recall Theorem-3 which states that an RKHS admits a unique reproducing kernel.
Theorem 4: Let X be a set and let H be an RKHS on X with reproducing kernel K. Then K is a kernel function.
For some arbitrary n∈N fix some arbitrary collection x1,…,xn∈X and α∈Cn. Then if we denote by K the gram matrix of K with respect to x1,…,xn, we have
What does it mean in the above proof if we have an equality? That is, if ⟨Kα,α⟩=0? This happens if and only if ∥∑i=1nαikxi∥=0. But this means that for every f∈H we have ∑i=1nαif(xi)=⟨f,∑iαikxi⟩=0. Thus, in this case there is an equation of linear dependence between the values of every function in H at this finite set of points.
Now let us state the converse of Theorem-4. It is a deep result in RKHS theory known as the Moore–Aronszajn theorem.
Theorem 5 [Moore–Aronszajn theorem]: Let X be a set and let K:X×X→C be a kernel function, then there exists a reproducing kernel Hilbert space H of functions on X such that K is the reproducing kernel of H.
In light of these two theorems we have the following notation.
Definition 24: Given a kernel function K:X×X→C, we let HK denote the unique RKHS with the reproducing kernel K.
It is not an easy problem to start with a kernel function K on some set X and give a concrete description of HK. I will not be discussing this here, but there are plenty of resources available where you can see this problem getting discussed. I recommend Paulsen and Raghupathi's An Introduction to the Theory of Reproducing Kernel Hilbert Spaces.
Recall the spectral theorem for finite-dimensional vector spaces: A linear operator T:V→V for some finite dimensional vector space V on C is normal, i.e., TT∗=T∗T, if and only if V has an orthonormal basis consisting of eigenvectors of T. This implies that if U=[v1,…,vn] is a unitary matrix containing the ith eigenvector in its ith column and Λ=diag(λ1,…,λn) is a diagonal matrix containing the corresponding eigenvalues, then if K is a normal matrix then it can written as
K=UΛU⊤=i=1∑nλivivi⊤.
Mercer's theorem generalizes this decomposition to kernel functions. Let us start by defining a special type of kernel function.
Definition 25: Let X be a compact metric space. A function K:X×X→C is called a Mercer kernel if it is a continuous kernel function.
Recall the space L2(μ) from Definition-8, where we now take X (written as X there) to be a compact metric space.
Definition 26: Given a Mercer kernel K:X×X→C, we define a linear operator TK:L2(μ)→L2(μ) as
TK(f)(x):=∫XK(x,y)f(y)dμ(y),x∈X.
We assume that the Mercer kernel satisfies the Hilbert-Schmidt condition, stated as
∫X×X∣K(x,y)∣2dμ(x)dμ(y)<∞,
which ensures that TK is a bounded linear operator on L2(μ).
Operators of this type are known as Hilbert-Schmidt operators.
We are now ready to state the Mercer's theorem.
Theorem 5 [Mercer's theorem]: Suppose that X is a compact metric space, and K:X×X→C is a Mercer's kernel that satisfies the Hilbert-Schmidt condition. Then there exists an at most countable set of eigenfunctions (ei)i for TK that forms an orthonormal basis of L2(μ), and a corresponding set of non-negative eigenvalues (λi)i such that
TK(ei)=λiei,i∈N.
Moreover, K has the expansion
K(x,y)=i∑λiei(x)ei(y),x,y∈X,
where the convergence of the series above holds absolutely and uniformly.
I'll skip the proof.
Among other things, Mercer's theorem provides a framework for embedding an element of X into an element of ℓ2(N) (for its definition, see Example-4 in the section on Hilbert spaces in the previous part). More concretely, given the eigenfunctions and eigenvalues guaranteed by Mercer's theorem, we may define a mapping Φ:X→ℓ2(N) as follows
X∋x↦Φ(x):=(λiei(x))i∈N∈ℓ2(N).
Therefore, we have
⟨Φ(x),Φ(y)⟩=i=1∑∞λiei(x)ei(y)=K(x,y).
This is the well-known "kernel trick". Let us connect Mercer's theorem to the spectral theorem.
Let X=[d]:={1,2,…,d} along with the Hamming metric be our compact metric space. Let μ({i})=1 for all i∈[d] be the counting measure on X. Any function f:X→C is equivalent to the d-dimensional vector [f(1),…,f(d)], and any kernel function K:X×X→C is continuous, satisfies the Hilbert-Schmidt condition and is equivalent to a d×d normal matrix K where Ki,j=K(i,j). The Hilbert-Schmidt operator reduces to
TK(f)(x)=∫XK(x,y)f(y)dμ(y)=i=1∑dK(x,y)f(y).
Mercer's theorem then states that there exists a set of eigenfunctions v1,…,vd (for our X they are equivalent to vectors) and the corresponding eigenvalues λ1,…,λd such that
Let us now consider how various algebraic operations on kernels affect the corresponding Hilbert spaces. All this (and a lot more) can be found in the seminal paper by Aronszajn "Theory of Reproducing Kernels".
I state the following theorems without proof to illustrate how operations on kernels are done.
Let us first define the notion of tensor product of two (separable) Hilbert spaces H1 and H2 of functions, say with domains X1 and X2.
Definition 27: Consider the set of functions h:X1×X2→C satisfying
H={h=i=1∑nuivi:n∈N and ui∈H1,vi∈H2 for all i∈[n]}.
We define an inner product on H as follows: for h=∑i=1nuivi and g=∑j=1mwjxj in H define
⟨h,g⟩:=i=1∑nj=1∑m⟨ui,wj⟩H1⟨vi,xj⟩H2.
Then H is a Hilbert space and is called the tensor product of H1 and H2. We denote it by H=H1⊗H2.
We can now state the theorem for product of kernels.
Theorem 7: Suppose that H1 and H2 are RKHSs of real-valued functions with domains X1 and X2, and equipped with kernels K1 and K2, respectively. Then the tensor product space H=H1⊗H2 is an RKHS of functions with domain X1×X2, and with kernel function K:(X1×X2)×(X1×X2)→C defined by
K((x,s),(y,t)):=K1(x,y)K2(s,t).
K is called the tensor product of the kernels K1 and K2, and denoted by K=K1⊗K2.
We can similarly define more operations on kernels:
If K is a valid kernel and α≥0, then αK is a valid kernel.
If K is a valid kernel and α≥0, then K+α is a valid kernel.
We can easily see from all these results that a linear combination or more generally for any polynomial P with positive coefficients, the composition P∘K is a valid kernel if K is a valid kernel.
If K is a valid kernel, then exp(K) is a valid kernel.
We are now at a stage where we can put all this theory to use in machine learning. Specifically we will develop Representer theorem which allows many optimization problems over the RKHS to be reduced to relatively simple calculations involving the gram matrix.
Let us start with a functional analytic viewpoint of supervised learning. Suppose we are given empirical data
(x1,y1),…,(xn,yn)∈X×Y,
where X is a nonempty set, and for now Y=R. The output is from an unknown function g:X→R, i.e., we assume
yi=g(xi),i∈[n].
We need to find some function f∗ which "best" approximates g. A natural way to formalize the notion of "best" is to limit ourselves to an RKHS H which contains functions of the form f:X→R and choose
f∗=f∈Hargmin∥f∥ such that f∗(xi)=yi for i∈[n].
This optimization problem is feasible whenever there exists at least one function f∈H that fits the data exactly. Denote by y the vector [y1,…,yn]⊤. It can be shown that if K is the gram matrix of the kernel K with respect to x1,…,xn, then the feasibility is equivalent to y∈range(K). As we will see, this is a special case of the representer theorem.
In a realistic setting we assume that we have noisy observations, i.e.,
yi=g(xi)+εi,i∈[n],
where εi's denote the noise. Then the constraint of exact fitting is no longer desirable, and we model the "best" approximation by introducing a loss function which represents how close our approximation is to the observed outputs. More concretely, let Ly:Rn→R be a continuous function. Then we can define our cost function as
J(f)=∥f∥2+Ly(f(x1),…,f(xn)).
Theorem 8 (Representer theorem): If f∗ is a function such that J(f∗)=f∈HinfJ(f), then f∗ is in the span of the functions kx1,…,kxn, i.e.,
f∗(⋅)=i=1∑nαikxi(⋅)for some α1,…,αn∈C.
I'll skip the proof which can be found in Paulsen and Raghupathi's book An Introduction to the Theory of Reproducing Kernel Hilbert Spaces or in Schölkopf, Herbrich and Smola's paper A Generalized Representer Theorem.
As an example Ly could be the squared loss Ly(f(x1),…,f(xn))=∑i=1n(yi−f(xi))2. If we assume Ly to be convex then the solution to (1) exists and is unique. Recall that E⊆Rk is a convex set if λx+(1−λ)y∈E whenever x,y∈E and 0≤λ≤1. A real-valued function L defined on a convex set E is a convex function if L(λx+(1−λ)y)≤λL(x)+(1−λ)L(y) whenever x,y∈E and 0≤λ≤1. Note that a convex function is continuous on the interior of its domain. Also check out Alexandrov theorem and Rademacher’s theorem.
The Representer theorem allows us to reduce the infinite-dimensional problem of optimizing over an RKHS to an n-dimensional problem.
I barely scratched the surface of reproducing kernel Hilbert space theory. Some resources I recommend that do a much better job than me in explaining this theory and which I used as reference are:
Paulsen V and Raghupathi M: An Introduction to the Theory of Reproducing Kernel Hilbert Spaces
Wainwright M: High-Dimensional Statistics
Schölkopf B, Herbrich R and Smola A.J: A Generalized Representer Theorem (COLT 2001)
Aronszajn N: Theory of Reproducing Kernels (1950)
In the next article I will discuss kernel approximation methods. In particular I will focus on random Fourier Features developed by Rahimi and Recht which I am using in my work.