Kernels - Part 0

Introduction
Background
Reproducing Kernel Hilbert Spaces (RKHS)
1. Example
Epilogue

Introduction

The word "kernel" is heavily overloaded, but for our purposes it is, intuitively, a similarity measure that can be thought of as an inner product in some feature space. Kernel methods provide an elegant, theoretically well-founded, and powerful approach to solving many learning problems.

We usually have the following framework: The input space \(\mathcal{X}\) which contains our observations/inputs/features is either not rich enough (for example, if there is no linear boundary separating the two classes in a binary classification problem) or not convenient (for example, if our inputs are strings), and therefore we want to work is some other space we call feature space \(\mathcal{H}.\) Suppose we have a map which takes our inputs from \(\mathcal{X}\) to \(\mathcal{H}\)

\[\begin{aligned} \Phi \colon \mathcal{X} \to \mathcal{H}.\end{aligned}\]

Then the class of kernels \(K : \mathcal{H} \times \mathcal{H} \to \mathbb{C}\) we are interested in are those for which is it possible to write

\[\begin{aligned} K(x,y) = \left\langle \Phi(x), \Phi(y) \right\rangle, \quad x,y \in \mathcal{X}.\end{aligned}\]

What kind of functions \(K\) admit such a representation? We aim to be able to answer this question and many others in this series of articles on kernels.

In this blog post, I aim to introduce the necessary concepts from analysis, linear algebra, and functional analysis so as to understand Reproducing Kernel Hilbert Spaces (RKHS) and a few of their basic properties. In the next blog post I will discuss the kernel trick and some important theorems like Mercer's theorem and Representer theorem.

Background

A topic like this requires quite a bit of background before we can get to the interesting results like the one above. It's always easy to get lazy and assume all the necessary background from the reader and get straight to the meat, but I want to write an article which I would have found useful had I had it when I started learning about kernel theory. With that being said, I am under no illusion and believe that a much better way to learn this background would be to read a functional analysis text if you have the time.

Linear Spaces

Definition 1: A linear space, or alternatively a vector space, over a field \(\mathbb{F}\) (\(\mathbb{F}\) is \(\mathbb R\) or \(\mathbb{C}\) for our purposes) is a set \(V\) of elements called vectors (the elements of \(\mathbb{F}\) are called scalars) satisfying:

To every pair, \(x\) and \(y\), of vectors in \(V\) there corresponds a vector \(x+y\), called the sum of \(x\) and \(y\), in such a way that
1. addition is commutative, \(x+y = y+x\),
2. addition is associative, \(x+(y+z) = (x+y)+z\),
3. there exists in \(V\) a unique vector \(0\) such that \(x+0=x\) for every vector \(x \in V\), and
4. to every vector \(x \in V\) there corresponds a unique vector \(-x\) such that \(x+(-x)=0.\)
To every pair, \(\alpha \in \mathbb{F}\) and \(x \in V\), there corresponds a vector \(\alpha x \in V\), called the product of \(\alpha\) and \(x\), in such a way that
1. multiplication by scalars is associative, i.e., if \(\beta \in \mathbb{F},\) then \(\alpha(\beta x) = (\alpha \beta)x\), and
2. \(1x = x\) for every vector \(x\), where \(1\) denotes the multiplicative identity of the field \(\mathbb{F}.\)
Finally the distributive properties
1. \(\alpha(x+y) = \alpha x + \alpha y\), and
2. \((\alpha + \beta) x = \alpha x + \beta x\).

Examples

A vector space must contain at least one element, namely \(0.\) In fact, the set \(\{0\}\) is a vector space over any field \(\mathbb{F}.\) This is called the trivial vector space.
For any \(n \in \mathbb N\), the Euclidean space \(\mathbb R^n\) is a vector spaces over \(\mathbb R\).
For any \(n \in \mathbb N\), \(\mathbb{C}^n\) is a vector spaces over \(\mathbb R\) or \(\mathbb{C}.\)

An interesting question: Is there a way of defining multiplication of real numbers by complex numbers so as to make the additive group \(\mathbb{R}\) a vector space over \(\mathbb{C}\)? (See here for an answer)
The set of all polynomials, with complex coefficients, in one variable is a vector space over \(\mathbb{C}.\)
The set \(C[0,1]\) of all continuous complex-valued functions on the unit interval \([0,1]\) is a vector space over \(\mathbb{C}.\)

Definition 2: A linear transformation of a linear space \(V\) into a linear space \(W\) is a mapping \(T: V \to W\) such that

\[\begin{aligned} T(\alpha x + \beta y) = \alpha T(x) + \beta T(y), \quad (x,y \in V;\; \alpha, \beta \in \mathbb{F}).\end{aligned}\]

Definition 3: In the special case in which \(W\) above is a field, \(T\) is called a linear functional.

Note that we often write \(Tx\) instead of \(T(x)\), if \(T\) is linear, hinting that a linear transformation mapping a finite dimensional vector to another finite dimensional vector space is equivalent to a matrix vector product.

Definition 4: Let \(\mu\) be a positive measure on an arbitrary measurable space \((X, \mathcal{F}).\) We define \(L^1(\mu)\) to be the collection of all complex measurable functions \(f\) on \(X\) for which

\[\begin{aligned} \int_X |f| \,\mathrm{d}\mu < \infty.\end{aligned}\]

It can be shown that for every \(f, g \in L^1(\mu)\) and for every \(\alpha, \beta \in \mathbb{C}\), we have \(\alpha f + \beta g \in L^1(\mu)\), and

\[\begin{aligned} \int_X (\alpha f + \beta g) \,\mathrm{d}\mu = \alpha \int_X f \,\mathrm{d}\mu + \beta \int_X g \,\mathrm{d}\mu.\end{aligned}\]

Thus, \(L^1(\mu)\) is a vector space, and the mapping \(F\colon L^1(\mu) \to \mathbb R\) defined by

\[\begin{aligned} F(f) = \int_X |f| \,\mathrm{d}\mu, \quad f \in L^1(\mu)\end{aligned}\]

is a linear functional.

Inner Products and Norms

Definition 5: Let \(V\) be a linear space over \(\mathbb{C}\), an inner product on \(V\) is a function \(\langle \cdot , \cdot \rangle \colon V \times V \to \mathbb{C}\) such that for all \(\alpha, \beta \in \mathbb{C}\), and all \(x,y,z \in V\), the following are satisfied:

Linearity in the first argument: \(\langle \alpha x + \beta y, z \rangle = \alpha \langle x,z \rangle + \beta \langle y,z \rangle\),
Conjugate symmetry: \(\left\langle x,y \right\rangle = \overline{\left\langle y,x \right\rangle}\),
Positivity: \(\left\langle x,x \right\rangle \geq 0\),
If \(\left\langle x,x \right\rangle = 0\), then \(x=0\).

A function satisfying only the first three properties is called a semi-inner product on \(V.\)

An immediate consequence of this definition: for any \(y \in V\), the mapping \(F\colon V \to \mathbb{C}\) defined by

\[\begin{aligned} F(x) = \left\langle x,y \right\rangle, \quad x \in V\end{aligned}\]

is a linear functional on \(V\).

Definition 6: If \(V\) be a linear space over \(\mathbb{C}\), a norm on \(V\) is a non-negative function \(\left\lVert \cdot \right\rVert : V \to \mathbb R\) such that for all \(\alpha \in \mathbb{C}\), and all \(x,y \in V\), the following are satisfied:

Subadditivity: \(\left\lVert x+y \right\rVert \leq \left\lVert x \right\rVert + \left\lVert y \right\rVert\),
Absolutely homogenous: \(\left\lVert \alpha x \right\rVert = |\alpha| \left\lVert x \right\rVert\),
Positive definite: \(\left\lVert x \right\rVert = 0 \implies x = 0\).

Given an inner product, we can define a norm as follows:

\[\begin{aligned} \left\lVert x\right\rVert = \sqrt{\left\langle x,x \right\rangle}.\end{aligned}\]

A classic result extremely useful in many proofs:

Theorem 1 (Cauchy-Schwarz inequality): In an inner product space \(V\),

\[\begin{aligned} |\left\langle x,y \right\rangle| \leq \left\lVert x \right\rVert \lVert y \rVert, \quad x,y \in V.\end{aligned}\]

Equality holds for \(y = \alpha x\) or \(y=0\).

The proof is not too difficult.

Definition 7: The virtue of norm on a vector space \(V\) is that

\[\begin{aligned} d(x,y) := \left\lVert x-y \right\rVert, \quad x,y \in V\end{aligned}\]

defines a metric on \(V\) so that \((V,d)\) becomes a metric space.

I will often write just \(V\) instead of \((V, d)\) when it's clear from context what the metric \(d\) is.

Definition 8: If \(0 < p < \infty\), \(f\) is a complex measurable function on \(X\), and \(\mu\) is a nonnegative measure on \(X\), define

\[\begin{aligned} \left\lVert f \right\rVert_p := \left( \int_X |f|^p \,\mathrm{d}\mu \right)^{1/p}\end{aligned}\]

and let \(L^p(\mu)\) consist of all \(f\) for which \(\left\lVert f \right\rVert_p < \infty.\) We call \(\left\lVert f \right\rVert_p\) the \(L^p-\)norm of \(f.\)

Hilbert Spaces

Definition 9: An inner product space, or alternatively a pre-Hilbert space, is a linear space with an inner product defined on it.

We need the concept of completeness to define Hilbert space. But before that let me define Cauchy sequences.

Definition 10: Given a metric space \((M, d)\), a sequence \((x_n)_{n \in \mathbb N}\) of elements in \(M\) is called a Cauchy sequence if for every positive real number \(\varepsilon > 0\) there exists a positive integer \(N \in \mathbb N\) such that \(m, n > N\) implies that \(d(x_m, x_n) < \varepsilon.\)

Recall that we say a sequence \((x_n)_{n \in \mathbb N}\) in a metric space \((M, d)\)converges if there exists a point \(x \in M\) with the following property: for every \(\varepsilon > 0\) there exists a positive integer \(N \in \mathbb N\) such that \(n > N\) implies that \(d(x_n, x) < \varepsilon.\)

It can be shown that every convergent sequence is a Cauchy sequence: Let the sequence \((x_n)_{n \in \mathbb N}\) in a metric space \((M, d)\) converge to \(x \in M.\) If \(\varepsilon > 0\), there is an integer \(N \in \mathbb N\) such that \(d(x_n, x) < \varepsilon\) for all \(n > N.\) Hence

\[\begin{aligned} d(x_n, x_m) \leq d(x_n, x) + d(x, x_m) < 2 \varepsilon\end{aligned}\]

for \(n,m > N.\) Thus \((x_n)_{n \in \mathbb N}\) is a Cauchy sequence.

The converse is not necessarily true. But if it holds in some space, we anoint the space with a special name.

Definition 11: A metric space \((M, d)\) is called complete if every Cauchy sequence of points in \(M\) has a limit that is also in \(M\) or, equivalently, if every Cauchy sequence in \(M\) converges in \(M.\)

Definition 12: A pre-Hilbert space is called a Hilbert space if it is complete in the metric induced by the norm induced by the inner product.

Examples

For any \(n \in \mathbb N\) the sets \(\mathbb R^n\) and \(\mathbb{C}^n\) are Hilbert spaces if we define the inner product to be the usual inner product \(\left\langle x, y \right\rangle := \sum_{i=1}^n x_i \overline{y_i}.\)
The set \(C[0,1]\) of all continuous complex functions on the unit interval \([0,1]\) defined above is an inner product space if we define

\[\begin{aligned} \left\langle f,g \right\rangle := \int_0^1 f(x) \overline{g(x)} \,\mathrm{d} x,\end{aligned}\]

but is not a Hilbert space. To see the last claim, consider the sequence of continuous functions \(\{f_n\}\) defined by

\[\begin{aligned} f_n(x) = \max \{(2x)^n, 1\}.\end{aligned}\]

It is a Cauchy sequence that does not converge to any point in \(C[0,1]\).
\(L^2(\mu)\) is a Hilbert space, with inner product

\[\begin{aligned} \left\langle f,g \right\rangle := \int_X f \, \overline{g} \,\mathrm{d}\mu.\end{aligned}\]
The space of square-summable real-valued sequences, namely

\[\begin{aligned} \ell^2(\mathbb N) := \left\{ (x_n)_{n \in \mathbb N} \; : \; x_n \in \mathbb R,\, \sum_n x_n^2 < \infty \right\}.\end{aligned}\]

This set, when endowed with the inner product \(\left\langle x,y \right\rangle := \sum_{n \in \mathbb N} x_n y_n\), defines a Hilbert space. It will play an important role in our discussion of eigenfunctions for Reproducing Kernel Hilbert spaces.

Definition 13: Consider a linear space \(\mathcal{F}\) of functions each of which is a mapping from a set \(X\) into \(\mathbb{F}.\) For \(x \in X\), a linear evaluation functional is a linear functional \(E_x\) that is defined as

\[\begin{aligned} E_x(f) = f(x), \quad f \in \mathcal{F}.\end{aligned}\]

In other words, a linear evaluation functional with respect to \(x \in X\) evaluates each function at \(x.\)

In general, the evaluation functional is not continuous. This means we can have \(f_n \to f\) but \(E_x(f_n)\) does not converge to \(E_x(f).\) Intuitively, this is because Hilbert spaces can contain very unsmooth functions. We will later consider a special type of Hilbert space, Reproducing Kernel Hilbert Space where all evaluation functionals are continuous.

A lemma that will be useful later on:

Lemma 1: Let \(\mathcal{H}\) be a Hilbert space and \(L\colon \mathcal{H} \to \mathbb{F}\) be a linear functional. The following statements are equivalent:

\(L\) is continuous.
\(L\) is continuous at \(0.\)
\(L\) is continuous at some point.
\(L\) is bounded, i.e., there is a constant \(c > 0\) such that \(|L(f)| \leq c \left\lVert f\right\rVert\) for every \(f \in \mathcal{H}.\)

It is clear that \((1) \implies (2) \implies (3)\), and \((4) \implies (2).\) Let's show that \((3) \implies (1)\), and \((2) \implies (4).\)

\((3) \implies (1)\): Suppose \(L\) is continuous at \(f \in \mathcal{H}\) and \(g\) is any point in \(\mathcal{H}.\) If \(g_n \to g\) in \(\mathcal{H}\), then \(g_n - g + f \to f.\) By assumption

\[\begin{aligned} L(f) = \lim_{n \to \infty} L(g_n - g + f) = \lim_{n \to \infty} L(g_n) - L(g) + L(f).\end{aligned}\]

Hence \(L(g) = \lim_{n \to \infty} L(g_n).\)

\((2) \implies (4)\): The definition of continuity at \(0\) implies that \(L^{-1}(\{\alpha \in \mathbb{F} : |\alpha| < 1\})\) contains an open ball centered at \(0.\) Let \(\delta > 0\) be the radius of that open ball centered at \(0.\) Then for \(f \in \mathcal{H}\) and \(\left\lVert f\right\rVert < \delta\) we have \(|L(f)| < 1.\) If \(f\) is an arbitrary element of \(\mathcal{H}\) and \(\varepsilon > 0\), then

\[\begin{aligned} \left\lVert \frac{\delta f}{\left\lVert f \right\rVert + \varepsilon} \right\rVert < \delta.\end{aligned}\]

Hence,

\[\begin{aligned} 1 > \left\lvert L\left( \frac{\delta f}{\left\lVert f\right\rVert + \varepsilon} \right) \right\rvert = \frac{\delta }{\left\lVert f \right\rVert + \varepsilon} |L(f)|.\end{aligned}\]

Letting \(\varepsilon \to 0\) we see that \((4)\) holds with \(c = 1/\delta.\)

Orthonormal Bases

We generalize the idea of orthonormal basis that is familiar from linear algebra to infinite dimensional case. This will be needed when we discuss Mercer's theorem.

Definition 14: A collection of vectors \(\{v_{\alpha} \, : \, \alpha \in A \}\) in a Hilbert space \(\mathcal{H}\) for some index set \(A\) is called orthonormal if it satisfies \(\left\langle v_{\alpha},v_{\beta}\right\rangle = \delta_{\alpha \beta}\) where \(\delta_{\alpha \beta}\) is the Kronecker delta, which equals \(1\) if \(\alpha = \beta\) and \(0\) otherwise.

Definition 15: A collection of vectors \(\{v_{\alpha} \, : \, \alpha \in A \}\) in a Hilbert space \(\mathcal{H}\) is called complete if for any \(u \in \mathcal{H}\), \(\left\langle u, v_{\alpha} \right\rangle = 0\) for all \(\alpha \in A\) implies that \(u = 0.\)

Definition 16: An orthonormal basis is a complete orthonormal system.

Note, we can also define an orthonormal basis as a maximal orthonormal set in \(\mathcal{H}.\) To say \(\{v_{\alpha}\}_{\alpha \in A}\) is maximal means that no vector of \(\mathcal{H}\) can be added to \(\{v_{\alpha}\}_{\alpha \in A}\) in such a way that the resulting set is still orthonormal. This happens precisely when there is no \(u \neq 0\) in \(\mathcal{H}\) that is orthogonal to every element of \(\{v_{\alpha}\}_{\alpha \in A}.\)

Separable Hilbert Spaces

Another key idea we need is that of separability. Let us define that now.

Definition 17: A topological space is called separable if it contains a countable, dense subset; that is, there exists a sequence \((x_{n})_{n \in \mathbb N}\) of elements of the space such that every nonempty open subset of the space contains at least one element of the sequence.

The notion of separability is closely related to the second-countability of a topological space. Recall that a space is second-countable if it has a countable basis for its topology. It is easy to see that a second-countable space is separable. For a metric space, the converse also holds, and thus the two notions are equivalent.

Definition 18: A Hilbert space is separable if and only if it has a countable orthonormal basis. It follows that any separable, infinite-dimensional Hilbert space is isometric to the space \(\ell^2(\mathbb N)\) of square-summable sequences.

We will be dealing with separable Hilbert spaces in our discussion.

Riesz Representation Theorem

We now come to a very important theorem called the Riesz representation theorem. The name Riesz has many theorems attached to it, but the one relevant to us is the following:

Theorem 2: For each continuous linear functional \(L\) on a Hilbert space \(\mathcal{H}\), there exists a unique \(g \in \mathcal{H}\) such that

\[\begin{aligned} L(f) = \left\langle f,g \right\rangle, \quad f \in \mathcal{H}.\end{aligned}\]

I'll skip the proof as it's not easy and will unnecessarily make this article abstruse.

Side-note: In the mathematical treatment of quantum mechanics, this theorem can be seen as a justification for the popular bra–ket notation.

Reproducing Kernel Hilbert Spaces (RKHS)

Definition 19: Let \(\mathcal{X}\) be a set. We will call a set \(\mathcal{H}\) of functions from \(\mathcal{X}\) to \(\mathbb{F}\) a Reproducing Kernel Hilbert Space (RKHS) on \(\mathcal{X}\) if

\(\mathcal{H}\) is a vector space,
\(\mathcal{H}\) is endowed with an inner product, \(\langle\cdot,\cdot\rangle\), with respect to which \(\mathcal{H}\) is a Hilbert space,
for every \(x \in \mathcal{X}\), the linear evaluation functional \(E_x : \mathcal{H} \to \mathbb{F}\), is bounded (or, equivalently, continuous, as dictated by Lemma-1).

If \(\mathcal{H}\) is an RKHS on \(\mathcal{X}\), then an application of the Riesz representation theorem shows that the linear evaluation functional is given by the inner product with a unique vector in \(\mathcal{H}.\) Therefore, for each \(x \in \mathcal{X}\), there exists a unique vector \(k_x \in \mathcal{H}\), such that for every \(f \in \mathcal{H}\),

\[\begin{aligned} f(x) = E_x(f) = \left\langle f,k_x \right\rangle.\end{aligned}\]

Definition 20: The function \(k_x\) just defined is called the reproducing kernel for the point \(x.\) The function \(K\colon \mathcal{X} \times \mathcal{X} \to \mathbb{F}\) defined by

\[\begin{aligned} K(x,y) = k_y(x)\end{aligned}\]

is called the reproducing kernel for \(\mathcal{H}.\)

Note that we have

\[\begin{aligned} K(x,y) = k_y(x) = \left\langle k_y, k_x\right\rangle = \overline{\left\langle k_y, k_x\right\rangle} = \overline{K(y,x)}.\end{aligned}\]

Also,

\[\begin{aligned} \left\lVert E_y \right\rVert^2 = \left\lVert k_y \right\rVert^2 = \left\langle k_y, k_y \right\rangle = K(y,y).\end{aligned}\]

Example

The first question that comes to mind is if any reproducing kernel Hilbert spaces exist. The following example answers this question in the affirmative.

We saw before that \(\mathbb{C}^n\) is a Hilbert space. We can show that \(\mathbb{C}^n\) is in fact an RKHS. Let \(\mathcal{X} = \{1, 2, \ldots, n\}\), then we can view \(v \in \mathbb{C}\) as a function \(V \colon \mathcal{X} \to \mathbb{C}\), where \(V(j) = v_j.\) The linear evaluation functionals are of course bounded for every \(x \in \mathcal{X}\) and we have

\[\begin{aligned} V(j) = v_j = \left\langle V,e_j \right\rangle , \quad j \in \mathcal{X},\end{aligned}\]

where \(e_j\) is a vector with \(1\) at \(j^{\text{th}}\) position and \(0\) everywhere else. Therefore, the reproducing kernel for the point \(x \in \mathcal{X}\) is \(e_x\) and the reproducing kernel can be thought as the identity matrix.

Can there be multiple reproducing kernels for an RKHS? The following theorem answers this question.

Theorem 3: If an RKHS \(\mathcal{H}\) of functions on a set \(\mathcal{X}\) admits a reproducing kernel, \(K\), then \(K\) is uniquely determined by \(\mathcal{H}.\)

Suppose that there exists another reproducing kernel \(K'\) for \(\mathcal{H}.\) Then

\[\begin{aligned} \left\lVert k_y - k'_y \right\rVert = \left\langle k_y - k'_y, k_y - k'_y \right\rangle = \left\langle k_y - k'_y, k_y \right\rangle - \left\langle k_y - k'_y, k'_y \right\rangle = (k_y - k'_y)(y) - (k_y - k'_y)(y) = 0 \end{aligned}\]

for any \(y \in \mathcal{X}.\) In other words, \(k_y(x) = k'_y(x)\) for every \(x \in \mathcal{X}\) by the positive definite property of norms and hence the kernel is unique.

Epilogue

We covered quite a lot of ground in this blog post but I didn't even define a kernel as we commonly use in machine learning! In the next post I will do that and cover its fundamental properties.

Some resources I recommend to go into more depth on what's covered here are:

Halmos, P: Finite-Dimensional Vector Spaces.
Rudin, W: Real and Complex Analysis. (Chapter - 4)
Rudin, W: Functional Analysis.
Conway, J: A course in functional analysis.