The word "kernel" is heavily overloaded, but for our purposes it is, intuitively, a similarity measure that can be thought of as an inner product in some feature space. Kernel methods provide an elegant, theoretically well-founded, and powerful approach to solving many learning problems.
We usually have the following framework: The input space \(\mathcal{X}\) which contains our observations/inputs/features is either not rich enough (for example, if there is no linear boundary separating the two classes in a binary classification problem) or not convenient (for example, if our inputs are strings), and therefore we want to work is some other space we call feature space \(\mathcal{H}.\) Suppose we have a map which takes our inputs from \(\mathcal{X}\) to \(\mathcal{H}\)
Then the class of kernels \(K : \mathcal{H} \times \mathcal{H} \to \mathbb{C}\) we are interested in are those for which is it possible to write
What kind of functions \(K\) admit such a representation? We aim to be able to answer this question and many others in this series of articles on kernels.
In this blog post, I aim to introduce the necessary concepts from analysis, linear algebra, and functional analysis so as to understand Reproducing Kernel Hilbert Spaces (RKHS) and a few of their basic properties. In the next blog post I will discuss the kernel trick and some important theorems like Mercer's theorem and Representer theorem.
A topic like this requires quite a bit of background before we can get to the interesting results like the one above. It's always easy to get lazy and assume all the necessary background from the reader and get straight to the meat, but I want to write an article which I would have found useful had I had it when I started learning about kernel theory. With that being said, I am under no illusion and believe that a much better way to learn this background would be to read a functional analysis text if you have the time.
Definition 1: A linear space, or alternatively a vector space, over a field \(\mathbb{F}\) (\(\mathbb{F}\) is \(\mathbb R\) or \(\mathbb{C}\) for our purposes) is a set \(V\) of elements called vectors (the elements of \(\mathbb{F}\) are called scalars) satisfying:
To every pair, \(x\) and \(y\), of vectors in \(V\) there corresponds a vector \(x+y\), called the sum of \(x\) and \(y\), in such a way that
addition is commutative, \(x+y = y+x\),
addition is associative, \(x+(y+z) = (x+y)+z\),
there exists in \(V\) a unique vector \(0\) such that \(x+0=x\) for every vector \(x \in V\), and
to every vector \(x \in V\) there corresponds a unique vector \(-x\) such that \(x+(-x)=0.\)
To every pair, \(\alpha \in \mathbb{F}\) and \(x \in V\), there corresponds a vector \(\alpha x \in V\), called the product of \(\alpha\) and \(x\), in such a way that
multiplication by scalars is associative, i.e., if \(\beta \in \mathbb{F},\) then \(\alpha(\beta x) = (\alpha \beta)x\), and
\(1x = x\) for every vector \(x\), where \(1\) denotes the multiplicative identity of the field \(\mathbb{F}.\)
Finally the distributive properties
\(\alpha(x+y) = \alpha x + \alpha y\), and
\((\alpha + \beta) x = \alpha x + \beta x\).
A vector space must contain at least one element, namely \(0.\) In fact, the set \(\{0\}\) is a vector space over any field \(\mathbb{F}.\) This is called the trivial vector space.
For any \(n \in \mathbb N\), the Euclidean space \(\mathbb R^n\) is a vector spaces over \(\mathbb R\).
For any \(n \in \mathbb N\), \(\mathbb{C}^n\) is a vector spaces over \(\mathbb R\) or \(\mathbb{C}.\)
An interesting question: Is there a way of defining multiplication of real numbers by complex numbers so as to make the additive group \(\mathbb{R}\) a vector space over \(\mathbb{C}\)? (See here for an answer)
The set of all polynomials, with complex coefficients, in one variable is a vector space over \(\mathbb{C}.\)
The set \(C[0,1]\) of all continuous complex-valued functions on the unit interval \([0,1]\) is a vector space over \(\mathbb{C}.\)
Note that we often write \(Tx\) instead of \(T(x)\), if \(T\) is linear, hinting that a linear transformation mapping a finite dimensional vector to another finite dimensional vector space is equivalent to a matrix vector product.
It can be shown that for every \(f, g \in L^1(\mu)\) and for every \(\alpha, \beta \in \mathbb{C}\), we have \(\alpha f + \beta g \in L^1(\mu)\), and
Thus, \(L^1(\mu)\) is a vector space, and the mapping \(F\colon L^1(\mu) \to \mathbb R\) defined by
Definition 5: Let \(V\) be a linear space over \(\mathbb{C}\), an inner product on \(V\) is a function \(\langle \cdot , \cdot \rangle \colon V \times V \to \mathbb{C}\) such that for all \(\alpha, \beta \in \mathbb{C}\), and all \(x,y,z \in V\), the following are satisfied:
Linearity in the first argument: \(\langle \alpha x + \beta y, z \rangle = \alpha \langle x,z \rangle + \beta \langle y,z \rangle\),
Conjugate symmetry: \(\left\langle x,y \right\rangle = \overline{\left\langle y,x \right\rangle}\),
Positivity: \(\left\langle x,x \right\rangle \geq 0\),
If \(\left\langle x,x \right\rangle = 0\), then \(x=0\).
A function satisfying only the first three properties is called a semi-inner product on \(V.\)
An immediate consequence of this definition: for any \(y \in V\), the mapping \(F\colon V \to \mathbb{C}\) defined by
is a linear functional on \(V\).
Definition 6: If \(V\) be a linear space over \(\mathbb{C}\), a norm on \(V\) is a non-negative function \(\left\lVert \cdot \right\rVert : V \to \mathbb R\) such that for all \(\alpha \in \mathbb{C}\), and all \(x,y \in V\), the following are satisfied:
Subadditivity: \(\left\lVert x+y \right\rVert \leq \left\lVert x \right\rVert + \left\lVert y \right\rVert\),
Absolutely homogenous: \(\left\lVert \alpha x \right\rVert = |\alpha| \left\lVert x \right\rVert\),
Positive definite: \(\left\lVert x \right\rVert = 0 \implies x = 0\).
Given an inner product, we can define a norm as follows:
A classic result extremely useful in many proofs:
The proof is not too difficult.
I will often write just \(V\) instead of \((V, d)\) when it's clear from context what the metric \(d\) is.
We need the concept of completeness to define Hilbert space. But before that let me define Cauchy sequences.
Recall that we say a sequence \((x_n)_{n \in \mathbb N}\) in a metric space \((M, d)\)converges if there exists a point \(x \in M\) with the following property: for every \(\varepsilon > 0\) there exists a positive integer \(N \in \mathbb N\) such that \(n > N\) implies that \(d(x_n, x) < \varepsilon.\)
It can be shown that every convergent sequence is a Cauchy sequence: Let the sequence \((x_n)_{n \in \mathbb N}\) in a metric space \((M, d)\) converge to \(x \in M.\) If \(\varepsilon > 0\), there is an integer \(N \in \mathbb N\) such that \(d(x_n, x) < \varepsilon\) for all \(n > N.\) Hence
The converse is not necessarily true. But if it holds in some space, we anoint the space with a special name.
For any \(n \in \mathbb N\) the sets \(\mathbb R^n\) and \(\mathbb{C}^n\) are Hilbert spaces if we define the inner product to be the usual inner product \(\left\langle x, y \right\rangle := \sum_{i=1}^n x_i \overline{y_i}.\)
The set \(C[0,1]\) of all continuous complex functions on the unit interval \([0,1]\) defined above is an inner product space if we define
but is not a Hilbert space. To see the last claim, consider the sequence of continuous functions \(\{f_n\}\) defined by
It is a Cauchy sequence that does not converge to any point in \(C[0,1]\).
\(L^2(\mu)\) is a Hilbert space, with inner product
The space of square-summable real-valued sequences, namely
This set, when endowed with the inner product \(\left\langle x,y \right\rangle := \sum_{n \in \mathbb N} x_n y_n\), defines a Hilbert space. It will play an important role in our discussion of eigenfunctions for Reproducing Kernel Hilbert spaces.
In other words, a linear evaluation functional with respect to \(x \in X\) evaluates each function at \(x.\)
In general, the evaluation functional is not continuous. This means we can have \(f_n \to f\) but \(E_x(f_n)\) does not converge to \(E_x(f).\) Intuitively, this is because Hilbert spaces can contain very unsmooth functions. We will later consider a special type of Hilbert space, Reproducing Kernel Hilbert Space where all evaluation functionals are continuous.
A lemma that will be useful later on:
Lemma 1: Let \(\mathcal{H}\) be a Hilbert space and \(L\colon \mathcal{H} \to \mathbb{F}\) be a linear functional. The following statements are equivalent:
\(L\) is continuous.
\(L\) is continuous at \(0.\)
\(L\) is continuous at some point.
\(L\) is bounded, i.e., there is a constant \(c > 0\) such that \(|L(f)| \leq c \left\lVert f\right\rVert\) for every \(f \in \mathcal{H}.\)
It is clear that \((1) \implies (2) \implies (3)\), and \((4) \implies (2).\) Let's show that \((3) \implies (1)\), and \((2) \implies (4).\)
\((3) \implies (1)\): Suppose \(L\) is continuous at \(f \in \mathcal{H}\) and \(g\) is any point in \(\mathcal{H}.\) If \(g_n \to g\) in \(\mathcal{H}\), then \(g_n - g + f \to f.\) By assumption
\((2) \implies (4)\): The definition of continuity at \(0\) implies that \(L^{-1}(\{\alpha \in \mathbb{F} : |\alpha| < 1\})\) contains an open ball centered at \(0.\) Let \(\delta > 0\) be the radius of that open ball centered at \(0.\) Then for \(f \in \mathcal{H}\) and \(\left\lVert f\right\rVert < \delta\) we have \(|L(f)| < 1.\) If \(f\) is an arbitrary element of \(\mathcal{H}\) and \(\varepsilon > 0\), then
We generalize the idea of orthonormal basis that is familiar from linear algebra to infinite dimensional case. This will be needed when we discuss Mercer's theorem.
Note, we can also define an orthonormal basis as a maximal orthonormal set in \(\mathcal{H}.\) To say \(\{v_{\alpha}\}_{\alpha \in A}\) is maximal means that no vector of \(\mathcal{H}\) can be added to \(\{v_{\alpha}\}_{\alpha \in A}\) in such a way that the resulting set is still orthonormal. This happens precisely when there is no \(u \neq 0\) in \(\mathcal{H}\) that is orthogonal to every element of \(\{v_{\alpha}\}_{\alpha \in A}.\)
Another key idea we need is that of separability. Let us define that now.
The notion of separability is closely related to the second-countability of a topological space. Recall that a space is second-countable if it has a countable basis for its topology. It is easy to see that a second-countable space is separable. For a metric space, the converse also holds, and thus the two notions are equivalent.
We will be dealing with separable Hilbert spaces in our discussion.
We now come to a very important theorem called the Riesz representation theorem. The name Riesz has many theorems attached to it, but the one relevant to us is the following:
I'll skip the proof as it's not easy and will unnecessarily make this article abstruse.
Side-note: In the mathematical treatment of quantum mechanics, this theorem can be seen as a justification for the popular bra–ket notation.
Definition 19: Let \(\mathcal{X}\) be a set. We will call a set \(\mathcal{H}\) of functions from \(\mathcal{X}\) to \(\mathbb{F}\) a Reproducing Kernel Hilbert Space (RKHS) on \(\mathcal{X}\) if
\(\mathcal{H}\) is a vector space,
\(\mathcal{H}\) is endowed with an inner product, \(\langle\cdot,\cdot\rangle\), with respect to which \(\mathcal{H}\) is a Hilbert space,
for every \(x \in \mathcal{X}\), the linear evaluation functional \(E_x : \mathcal{H} \to \mathbb{F}\), is bounded (or, equivalently, continuous, as dictated by Lemma-1).
If \(\mathcal{H}\) is an RKHS on \(\mathcal{X}\), then an application of the Riesz representation theorem shows that the linear evaluation functional is given by the inner product with a unique vector in \(\mathcal{H}.\) Therefore, for each \(x \in \mathcal{X}\), there exists a unique vector \(k_x \in \mathcal{H}\), such that for every \(f \in \mathcal{H}\),
Note that we have
Also,
The first question that comes to mind is if any reproducing kernel Hilbert spaces exist. The following example answers this question in the affirmative.
We saw before that \(\mathbb{C}^n\) is a Hilbert space. We can show that \(\mathbb{C}^n\) is in fact an RKHS. Let \(\mathcal{X} = \{1, 2, \ldots, n\}\), then we can view \(v \in \mathbb{C}\) as a function \(V \colon \mathcal{X} \to \mathbb{C}\), where \(V(j) = v_j.\) The linear evaluation functionals are of course bounded for every \(x \in \mathcal{X}\) and we have
Can there be multiple reproducing kernels for an RKHS? The following theorem answers this question.
We covered quite a lot of ground in this blog post but I didn't even define a kernel as we commonly use in machine learning! In the next post I will do that and cover its fundamental properties.
Some resources I recommend to go into more depth on what's covered here are:
Halmos, P: Finite-Dimensional Vector Spaces.
Rudin, W: Real and Complex Analysis. (Chapter - 4)
Rudin, W: Functional Analysis.
Conway, J: A course in functional analysis.