The concept of conditional probability is central to probability theory and excellent treatment of it can be found in many books. My aim with this blog post is to consolidate in one place some ideas around it which helped me form a better intuition. These ideas will be useful if you have already been exposed to this concept from a textbook and just want one more person's ramblings about it.
I will start by defining conditional expectation and stating some of its properties. It will be a grave injustice to claim my discussion of it is complete since I don't even prove its existence; this section exists solely for establishing notation. I will then spend some time discussing conditional probability, relating it to the traditional notion of
These discussions will naturally lead to the notions of regular conditional probability and regular conditional distribution which I discuss next.
Recall the concept of conditional expectation.
Theorem 1: Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space, and \(X\) a random variable with \(\mathbb{E}(|X|) < \infty.\) Let \(\mathcal{G}\) be a sub-\(\sigma\)-algebra of \(\mathcal{F}.\) Then there exists a random variable \(Y\) such that
\(Y\) is \(\mathcal{G}\)-measurable,
\(\mathbb{E}(|Y|) < \infty\), and
\(\int_G Y \,\mathrm{d}\mathbb{P} = \int_G X \,\mathrm{d}\mathbb{P}\) for every \(G \in \mathcal{G}.\)
Remarks:
It is easy to see from the \(\pi-\lambda\) theorem that the last condition can be relaxed such that \(\int_G Y \,\mathrm{d}\mathbb{P} = \int_G X \,\mathrm{d}\mathbb{P}\) for every \(G\) in some \(\pi\)-system which contains \(\Omega\) and generates \(\mathcal{G}.\)
If \(Y'\) is another random variable with the three properties above then \(Y' = Y\) a.s.. Therefore, \(Y\) in the theorem above is called a version of the conditional expectation. The notation \(\mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right)\) is used to denote this unique (up to a.e. equivalence) random variable.
The proof of this standard theorem can be found in any probability textbook; see (Williams, 1991) or (Kallenberg, 2021) for example.
The fact that conditional expectation is defined as a random variable might come as a surprise, but the correspondence with the traditional usage of conditional expectation as a number becomes clear once you realize that here we are conditioning on a \(\sigma\)-algebra instead of a single event (we haven’t defined what conditioning on an event means, but think of the intuitive meaning for now). For example, consider the life expectancy of a new born baby conditioned on sex. This is a random variable that takes one value for males and another value for females.
For completeness I state some useful properties of conditional expectation. You can find the proofs in [1,2] for example. Most of them are parallels to the well-known properties of (unconditional) expectation. Assume that all the \(X\)'s satisfy \(\mathbb{E}(|X|) < \infty\) and let \(\mathcal{G}, \mathcal{H}\) be sub-\(\sigma\)-algebras of \(\mathcal{F}.\)
[Linearity] \(\mathbb{E}\!\left(\left. a_1 X_1 + a_2 X_2\,\right\vert\, \mathcal{G}\right) = a_1 \mathbb{E}\!\left(\left. X_1\,\right\vert\, \mathcal{G}\right) + a_2 \mathbb{E}\!\left(\left. X_2\,\right\vert\, \mathcal{G}\right)\) a.s. for real numbers \(a_1\) and \(a_2.\)
[Positivity] If \(X \ge 0\), then \(\mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right) \ge 0\) a.s..
[Monotone convergence theorem for conditional expectation] If \(\mathbb{E}(|Y|) < \infty\) and \(Y \le X_n \uparrow X\) a.s., then \(\mathbb{E}\!\left(\left. X_n\,\right\vert\, \mathcal{G}\right) \uparrow \mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right)\) a.s..
[Fatou's lemma for conditional expectation] If \(\mathbb{E}(|Y|) < \infty\) and \(Y \le X_n\) for all \(n \ge 1\) a.s., then \(\mathbb{E}\!\left(\left. \liminf_{n \to \infty} X_n\,\right\vert\, \mathcal{G}\right) \le \liminf_{n \to \infty}\mathbb{E}\!\left(\left. X_n\,\right\vert\, \mathcal{G}\right)\) a.s..
[Dominated convergence theorem for conditional expectation] If \(|X_n| \le |Y|\) for all \(n \ge 1\), \(\mathbb{E}(|Y|) < \infty\), and \(X_n \to X\) a.s., then \(\mathbb{E}\!\left(\left. X_n\,\right\vert\, \mathcal{G}\right) \to \mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right)\) a.s..
[Tower property] If \(\mathcal{H} \subseteq \mathcal{G}\), then \(\mathbb{E}\!\left(\left. \mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right)\,\right\vert\, \mathcal{H}\right) = \mathbb{E}\!\left(\left. \mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{H}\right)\,\right\vert\, \mathcal{G}\right)=\mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{H}\right)\) a.s.
[Taking out what's known] If \(Y\) is \(\mathcal{G}\)-measurable and bounded, then \[\begin{aligned} \mathbb{E}\!\left(\left. YX\,\right\vert\, \mathcal{G}\right) = Y \mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right) \text{ a.s..}\end{aligned}\] If \(p > 1\), \(1/p + 1/q = 1\), \(X \in L^p(\Omega, \mathcal{F}, \mathbb{P})\) and \(Y \in L^q(\Omega, \mathcal{G}, \mathbb{P})\), then \((1)\) again holds. If \(X\) is a nonnegative \(\mathcal{F}\)-measurable random variable, \(Y\) is a nonnegative \(\mathcal{G}\)-measurable random variable, \(\mathbb{E}(X) < \infty\) and \(\mathbb{E}(XY) < \infty\), then also (1) holds.
[Role of independence] If \(\mathcal{H}\) is independent of \(\sigma(\sigma(X) \cup \mathcal{G})\), then \(\mathbb{E}\!\left(\left. X\,\right\vert\, \sigma(\mathcal{G} \cup \mathcal H)\right) = \mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{G}\right)\) a.s.. In particular, if \(X\) is independent of \(\mathcal{H}\), then \(\mathbb{E}\!\left(\left. X\,\right\vert\, \mathcal{H}\right) = \mathbb{E}(X)\) a.s..
Just like conditional expectation, conditional probability, as defined above, is a random variable! Unlike conditional expectation this isn't very palpable and deserves more rumination (Halmos, 1950). We have our probability space \((\Omega, \mathcal{F}, \mathbb{P})\) and let \(A,B \in \mathcal{F}\) be such that \(\mathbb{P}(B) \neq 0\) and \(\mathbb{P}(B^\mathsf{c}) \neq 0.\) Then our traditional notion of conditional probability tells us that the conditional probability of \(A\) given \(B\) is defined by
Let us investigate how \(\mathbb{P}_B(A)\) depends on \(B.\) To this end, introduce the discrete measurable space \((\Lambda, 2^\Lambda)\) with \(\Lambda = \{\lambda_1, \lambda_2\}\), and a measurable mapping \(T \colon \Omega \to \Lambda\) such that
Define the two measures \(\nu_A\) and \(\nu\) on \((\Lambda, 2^\Lambda)\) as follows for any \(E \subseteq \Lambda\),
Then it is easy to see that
In other words conditional probability may be viewed as a measurable function on \(\Lambda.\)
This can easily be generalized to any finite setting as follows. Let \(\{A_1, \ldots, A_n\} \subseteq \mathcal{F}\) be a partition of \(\Omega\), i.e., \(A_i \cap A_j = \varnothing\) for \(i \neq j\) and \(\bigcup_i A_i = \Omega.\) Introduce the discrete measurable space \((\Lambda, 2^\Lambda)\) with \(\Lambda = \{\lambda_1, \ldots, \lambda_n\}.\) Define a measurable mapping \(T \colon \Omega \to \Lambda\) such that \(T(\omega) = \lambda_i\) whenever \(\omega \in A_i.\) Define the measures \(\nu_{A_1}, \ldots, \nu_{A_n}, \nu\) on \((\Lambda, 2^\Lambda)\) as follows for any \(E \subseteq \Lambda\),
These considerations are what motivated the definition of conditional probability in general cases, as you see in Definition 2. If \(T\) is any measurable mapping from \((\Omega, \mathcal{F}, \mathbb{P})\) into an arbitrary measurable space \((\Lambda, \mathcal{L})\), and if we write \(\nu_A(E) = \mathbb{P}(A \cap T^{-1}(E))\) where \(A \in \mathcal{F}\) and \(E \in \mathcal{L}\), then it is clear that \(\nu_E\) and \(\mathbb{P} \circ T^{-1}\) are measures on \(\mathcal{L}\) such that \(\nu_A \ll \mathbb{P} \circ T^{-1}.\) Radon-Nikodym theorem now implies that there exists an \(\mathbb{P} \circ T^{-1}\)-integrable function \(p_A\), unique upto \(\mathbb{P} \circ T^{-1}\)-a.e., such that
We anoint \(p_A(\lambda)\) as the conditional probability of \(A\) given \(\lambda \in \Lambda\) or the conditional probability of \(A\) given that \(T(\omega) = \lambda.\) Note that here we are conditioning on a measurable mapping \(T\) instead of a sub-\(\sigma\)-algebra, but this notion is related to conditioning on \(\sigma(T)\) as will become clear ahead. Keep this "rumination" in mind when we discuss regular conditional distribution later.
Let's look at our definition of conditional probability from the other direction and show that \(\mathbb{P}(A \mid B)\) as defined in Definition 2 conforms to our traditional usage. To start, note that \(\sigma(\mathbf{1}_B) =\{\varnothing, B, B^\mathsf{c}, \Omega\}\), and since \(\mathbb{P}(A \mid B)\) is \(\sigma(\mathbf{1}_B)\)-measurable, it must be constant on each of the sets \(B, B^\mathsf{c}\), thereby necessitating
Positivity property (property 2. above) and monotone convergence property (property 3. above) of conditional expectation imply \(0 \le\mathbb{P}\!\left(\left. A\,\right\vert\, \mathcal{G}\right) \le 1\) a.s. for any \(A \in \mathcal{F}\), \(\mathbb{P}\!\left(\left. A\,\right\vert\, \mathcal{G}\right) = 0\) a.s. if and only if \(\mathbb{P}(A) = 0\), and \(\mathbb{P}\!\left(\left. A\,\right\vert\, \mathcal{G}\right) = 1\) a.s. if and only if \(\mathbb{P}(A) = 1.\)
Let \(A_1, A_2, \ldots \in \mathcal{F}\) be a sequence of disjoint sets. By linearity (property 1. above) and monotone convergence theorem for conditional expectation (property 3. above), we see that
\[\begin{aligned} \mathbb{P}\!\left(\left. \bigcup_n A_n\,\right\vert\, \mathcal{G}\right) = \sum_n \mathbb{P}\!\left(\left. A_n\,\right\vert\, \mathcal{G}\right) \quad \text{a.s.}.\end{aligned}\]If \(A_n \in \mathcal{F}\) for \(n \ge 1\) and \(\lim_{n \to \infty} A_n = A\), then we also have \(\lim_{n \to \infty} \mathbb{P}\!\left(\left. A_n\,\right\vert\, \mathcal{G}\right) = \mathbb{P}\!\left(\left. A\,\right\vert\, \mathcal{G}\right)\) a.s..
It seems very tempting from the foregoing discussion to claim that \(\mathbb{P}\!\left(\left. \cdot\,\right\vert\, \mathcal{G}\right)(\omega)\) is a probability measure on \(\mathcal{F}\) for almost all \(\omega \in \Omega\), but except for some nice spaces, which we will discuss below, this isn't true. Let us first try to see this intuitively (Chow and Teicher, 1997). Equation (2) holds for all \(\omega \in \Omega\) EXCEPT for some null set which may well depend on the particular sequence \(\{A_n\}_{n \in \mathbb N}.\) It does NOT stipulate that there exists a fixed null set \(N \in \mathcal{F}\) such that
Motivated by our discussion above, we define regular conditional probability as follows.
Definition 3: Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space and \(\mathcal{G}, \mathcal{H}\) be sub-\(\sigma\)-algebras of \(\mathcal{F}.\) A regular conditional probability on \(\mathcal{H}\) given \(\mathcal{G}\) is a function \(\mathbb{P}(\cdot, \cdot) \colon \Omega \times \mathcal{H} \to [0,1]\) such that
for a.e. \(\omega \in \Omega\), \(\mathbb{P}(\omega, \cdot)\) is a probability measure on \(\mathcal{H}\),
for each \(A \in \mathcal{H}\), \(\mathbb{P}(\cdot, A)\) is a \(\mathcal{G}\)-measurable function on \(\Omega\) coinciding with the conditional probability of \(A\) given \(\mathcal{G}\), i.e., \(\mathbb{P}(\cdot, A) = \mathbb{P}\!\left(\left. A\,\right\vert\, \mathcal{G}\right)\) a.s..
Let's show that this definition is not outrageous by showing that it agrees with our traditional notion of conditional pdf and conditional expectation. To that end, suppose that \(X\) and \(Y\) are random variables which have a joint probability density function \(f_{X,Y}(x,y).\) This means that we are considering the probability space \((\mathbb R^2, \mathcal{B}(\mathbb R^2), \mathbb{P})\) with \(X\) and \(Y\) being the coordinate random variables, i.e. \((x,y) \mapsto x\) and \((x,y) \mapsto y\) respectively, and having an absolutely continuous distribution function \(F_{X,Y}(x,y)\) such that
and so \(\mathbb{P}(\cdot, A) = \mathbb{E}\!\left(\left. \mathbf{1}_A\,\right\vert\, \mathcal{G}\right)= \mathbb{P}\!\left(\left. A\,\right\vert\, \mathcal{G}\right)\) a.s.. Hence, \(\mathbb{P}(\omega, A)\) is a regular conditional probability on \(\mathcal{H}\) given \(\mathcal{G}.\)
For the corresponding analysis for conditional expectation, let \(h\) be a Borel function on \(\mathbb R^2\) such that
In general, we have the following useful theorem (taken from (Chow and Teicher, 1997)) which allows us to view conditional expectations as ordinary expectations relative to the measure induced by regular conditional probability.
Recall the monotone class theorem for functions:
By separate considerations of \(X^+\) and \(X^-\), it may be supposed that \(X \ge 0.\) Define
If \(X_1, X_2 \in \mathscr{H}\) and \(c_1, c_2 \ge 0\), then \(c_1 X_1 + c_2 X_2 \ge 0\), \(c_1 X_1 + c_2 X_2\) is \(\mathcal{H}\)-measurable and Equation (3) holds because of linearity of expectation and conditional expectation, and thus \(c_1 X_1 + c_2 X_2 \in \mathscr{H}.\) If \(\{X_n\}_{n \in \mathbb N} \subseteq \mathscr{H}\) such that \(X_n \uparrow X\), then \(X \ge 0\), \(X\) is \(\mathcal{H}\)-measurable, and Equation (3) holds for \(X\) because of monotone convergence theorem for expectation and conditional expectation, and thus \(X \in \mathscr{H}.\) Therefore, by the monotone class theorem \(\mathscr{H}\) contains all nonnegative \(\mathcal{H}\)-measurable functions.
In some cases even the concept of regular conditional probability in inadequate, and that motivates the concept of regular conditional distributions.
Definition 4: Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space, \(\mathcal{G} \subseteq \mathcal{F}\) a \(\sigma\)-algebra, \((\Lambda, \mathcal{L})\) a measurable space, and \(T \colon \Omega \to \Lambda\) a measurable mapping. A regular conditional distribution for \(T\) given \(\mathcal{G}\) is a function \(\mathbb{P}_T \colon \Omega \times \mathcal{L} \to [0,1]\) such that
for a.e. \(\omega \in \Omega\), \(\mathbb{P}_T(\omega, \cdot)\) is a probability measure on \(\mathcal{L}\),
for each \(A \in \mathcal{L}\), \(\mathbb{P}_T(\cdot, A)\) is a \(\mathcal{G}\)-measurable function on \(\Omega\) coinciding with the conditional probability of \(T^{-1}(A)\) given \(\mathcal{G}\), i.e., \(\mathbb{P}_T(\cdot, A) = \mathbb{P}\!\left(\left. T^{-1}(A)\,\right\vert\, \mathcal{G}\right)\) a.s..
It is clear that when \(\Lambda = \Omega\), \(\mathcal{L} = \mathcal{H} \subseteq \mathcal{F}\) and \(T\) is the identity map, \(\mathbb{P}_T\) is exactly the regular conditional probability as defined in Definition 3.
Now would be a good time to reread the first “rumination” in the section Conditional Probability and realize that the definition of regular conditional distribution is in fact well motivated.
A corresponding version of Theorem 2 exists, proof of which I'll leave as an easy exercise:
To see the power of thinking about conditional probabilities like this, let's give an unbelievably short proof of conditional Hölder's inequality that I took from (Chow and Teicher, 1997). Contrast it with other proofs.
Before we discuss their existence, let us define the concept of standard Borel space (Encyclopedia of Mathematics).
Definition 6: A measurable space \((X, \mathcal{X})\) is called a standard Borel space if it satisfies any of the following equivalent conditions:
\((X, \mathcal{X})\) is isomorphic to some compact metric space with the Borel \(\sigma\)-algebra.
\((X, \mathcal{X})\) is isomorphic to some Polish space (i.e., a separable complete metric space) with the Borel \(\sigma\)-algebra.
\((X, \mathcal{X})\) is isomorphic to some Borel subset of some Polish space with the Borel \(\sigma\)-algebra.
As you can guess most spaces we deal with are standard Borel spaces. (Durrett, 2019) calls these space nice since we already have too many things named after Borel. I am not sure I agree with his reasoning but I like Durrett's terminology.
The next two theorems show the existence of regular conditional distribution and are taken from [(Durrett, 2019), Section 4.1.3]. See also [(Parthasarathy, 1967), Section V.8].
A generalization of the last theorem:
Theorem 6: Suppose \((\Lambda, \mathcal{L})\) is a nice space, \(T\) and \(S\) are measurable mappings from \(\Omega\) to \(\Lambda\), and \(\mathcal{G} = \sigma(S).\) Then there exists a function \(\mu \colon \Lambda \times \mathcal{L} \to [0,1]\) such that
for a.e. \(\omega \in \Omega\), \(\mu(S(\omega), \cdot)\) is a probability measure on \(\mathcal{L}\), and
for each \(A \in \mathcal{L}\), \(\mu(S(\cdot), A) = \mathbb{P}(T^{-1}(A)\mid\mathcal{G})\) a.s..
It is instructive to prove Theorem 5 in the special case when \((\Lambda, \mathcal{L}) = (\mathbb R^n, \mathcal{B}(\mathbb R^n)).\) The theorem and the proof is taken from (Chow and Teicher, 1997).
Let's recall the definition of an \(n\)-dimensional distribution function on \(\mathbb R^n\) (the Russian convention of left-continuous distribution function).
We will try to construct a distribution function on \(\mathbb R^n.\) To this end, for any rational number \(r_1, \ldots, r_n\) and \(\omega \in \Omega\), define \[\begin{aligned} F_n^\omega(r_1, \ldots, r_n) := \mathbb{P}\!\left(\left. \bigcap_{i=1}^n \{T_i < r_i\} \,\right\vert\, \mathcal{G}\right)(\omega).\end{aligned}\]
It's evident that the properties of conditional probability discussed above imply that there is a null set \(N \in \mathcal{G}\) such that for \(\omega \in \Omega \setminus N\) and all rational numbers \(r_i, r_i', q_{i,m}\) the following holds
Then for each \(\omega \in \Omega\), \(F_n^\omega(x_1, \ldots, x_n)\) is an \(n\)-dimensional distribution function and hence determines a Lebesgue-Stieltjes measure \(\mu_\omega\) on \(\mathcal{B}(\mathbb R^n)\) with \(\mu_\omega(\mathbb R^n) = 1.\) For \(B \in \mathcal{B}(\mathbb R^n)\) and \(\omega \in \Omega\) define
In fact, this theorem is easily extended to \((\mathbb R^\infty, \mathcal{B}(\mathbb R^\infty))\) as follows: For all \(n \ge 1\), define \(F_n^\omega\) as in Equation (4). Select the null set \(N \in \mathcal{G}\) such that in addition to the conditions it satisfies above we also have the consistency condition
For reals \(x_1, \ldots, x_n\), define just like Equation (5). Then for each \(\omega \in \Omega\), \(\{F_n^\omega, \, n \ge 1\}\) is a consistent family of distribution functions, and hence by the Kolmogorov extension theorem there exists a unique measure \(\mu_\omega\) on \((\mathbb R^\infty, \mathcal{B}(\mathbb R^\infty))\) whose finite dimensional distributions are \(\{F_n^\omega, \, n \ge 1\}.\) Define \(\mathbb{P}_T(\omega, B) = \mu_\omega(B)\) for \(B \in \mathcal{B}(\mathbb R^\infty).\) If
Chow, Yuan Shih and Teicher, Henry. Probability Theory: Independence, Interchangeability, Martingales, 3rd edn. Springer-Verlag New York, 1997.
Durrett, R.: Probability: Theory and Examples, 5th edn. Cambridge University Press, 2019.
Halmos, P. R.. Measure Theory. Van Nostrand, Princeton, N. J., 1950; Springer-Verlag, Berlin and New York, 1974.
Kallenberg, Olav. Foundations of Modern Probability, Third Edition, Springer, 2021.
Parthasarathy, K. R.. Probability Measures on Metric Spaces. AMS Chelsea Publishing, 1967.
Williams, David. Probability with Martingales. Cambridge mathematical textbooks, Cambridge University Press, 1991.