Z-HatA (we)blog devoted to finding better representations
danabo.github.io/zhat/
Mon, 24 May 2021 10:46:45 -0700Mon, 24 May 2021 10:46:45 -0700Jekyll v4.2.0Bayesian Inference On 1st Order Logic<p>David Chapman’s blog post titled <a href="https://meaningness.com/probability-and-logic">Probability theory does not extend logic</a> has stirred up some controversy. In it, Chapman argues that so-called Bayesian logic, as it currently understood, is limited to <a href="https://en.wikipedia.org/wiki/Propositional_calculus">propositional logic</a> (0th order logic), but cannot generalize to <a href="https://en.wikipedia.org/wiki/Higher-order_logic">higher order logics</a> (e.g. <a href="https://en.wikipedia.org/wiki/First-order_logic">predicate logic</a> a.k.a. 1st order logic), and thus cannot be a general foundation for inference from data under uncertainty.</p>
<p>Chapman provides a few counter-examples that supposedly demonstrate that doing Bayesian inference on statements in 1st order logic is incoherent. I think there is a lot of confusion surrounding this point because Chapman does not use proper probability notation. In the following article I show how Chapman’s examples can be properly written and made sense of using random variables. Hopefully this clarifies some things.</p>
<!--more-->
<p>It’s a good exercise to see what would happen if you tried to do Bayesian inference on statements in 1st order logic, but I make no endorsements on what you <em>should</em> do. Such constructions, though syntactically valid, are often undecidable (or non-computable), and otherwise intractable to even approximate. It’s also not clear that you gain more power from doing so (see my <a href="#interlude-a-tale-of-infinite-coin-tosses">section on infinite coin tossing</a>).</p>
<p>Admittedly, this article is obnoxiously long. It’s not necessary to read all of it. In summary, I …</p>
<ol>
<li><a href="#the-formalism">Explain how probability notation is supposed to work.</a></li>
<li><a href="#review-of-bayesian-propositional-0-th-order-logic">Review Bayesian 0th-order logic</a> and <a href="#philosophy-of-bayesian-probability">explain my understanding of Bayesian probability.</a></li>
<li><a href="#motivating-examples">Go through some examples that hopefully demonstrate natural ways that probability and 1st-order logic can interact.</a></li>
<li>(then finally) <a href="#david-chapmans-challenge-problems">reinterpret Chapman’s “challenge” problems</a>.</li>
</ol>
<p>I also posted a shorter version of this on <a href="https://www.lesswrong.com/posts/W8YscokXMiDnLKJ96/bayesian-inference-on-1st-order-logic">LessWrong</a>.</p>
<ul class="toc" id="markdown-toc">
<li><a href="#what-is-bayesian-anyway" id="markdown-toc-what-is-bayesian-anyway">What is “Bayesian” anyway?</a></li>
<li><a href="#the-formalism" id="markdown-toc-the-formalism">The Formalism</a> <ul>
<li><a href="#quick-review-of-probability-theory" id="markdown-toc-quick-review-of-probability-theory">Quick review of probability theory</a> <ul>
<li><a href="#general-random-variable-notation" id="markdown-toc-general-random-variable-notation">General random variable notation</a></li>
</ul>
</li>
<li><a href="#conditional-probability" id="markdown-toc-conditional-probability">Conditional probability</a></li>
<li><a href="#quantifiers-inside-probability" id="markdown-toc-quantifiers-inside-probability">Quantifiers inside probability</a></li>
<li><a href="#on-structured-possibility-spaces" id="markdown-toc-on-structured-possibility-spaces">On structured possibility spaces</a></li>
</ul>
</li>
<li><a href="#review-of-bayesian-propositional-0-th-order-logic" id="markdown-toc-review-of-bayesian-propositional-0-th-order-logic">Review of Bayesian Propositional (0-th order) Logic</a></li>
<li><a href="#philosophy-of-bayesian-probability" id="markdown-toc-philosophy-of-bayesian-probability">Philosophy of Bayesian Probability</a></li>
<li><a href="#motivating-examples" id="markdown-toc-motivating-examples">Motivating examples</a> <ul>
<li><a href="#motivating-example-1-unicorns" id="markdown-toc-motivating-example-1-unicorns">Motivating Example 1: Unicorns</a> <ul>
<li><a href="#definitional-uncertainty" id="markdown-toc-definitional-uncertainty">Definitional uncertainty</a></li>
</ul>
</li>
<li><a href="#motivating-example-2-rigged-poker" id="markdown-toc-motivating-example-2-rigged-poker">Motivating Example 2: Rigged Poker</a> <ul>
<li><a href="#probabilities-of-probabilities" id="markdown-toc-probabilities-of-probabilities">Probabilities of probabilities</a></li>
<li><a href="#probability-precision-part-i" id="markdown-toc-probability-precision-part-i">Probability precision (part I)</a></li>
<li><a href="#time-series" id="markdown-toc-time-series">Time-series</a></li>
<li><a href="#the-reality-of-probability" id="markdown-toc-the-reality-of-probability">The reality of probability</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#david-chapmans-challenge-problems" id="markdown-toc-david-chapmans-challenge-problems">David Chapman’s Challenge Problems</a> <ul>
<li><a href="#boojums-and-snarks" id="markdown-toc-boojums-and-snarks">Boojums and snarks</a></li>
<li><a href="#inferring-a-generality-from-instances" id="markdown-toc-inferring-a-generality-from-instances">Inferring a generality from instances</a></li>
<li><a href="#interlude-a-tale-of-infinite-coin-tosses" id="markdown-toc-interlude-a-tale-of-infinite-coin-tosses">Interlude: A tale of infinite coin tosses</a></li>
<li><a href="#probability-inside-a-quantifier-inside-a-probability" id="markdown-toc-probability-inside-a-quantifier-inside-a-probability">Probability inside a quantifier inside a probability</a> <ul>
<li><a href="#probability-precision-part-ii" id="markdown-toc-probability-precision-part-ii">Probability precision (part II)</a></li>
</ul>
</li>
<li><a href="#final-boss" id="markdown-toc-final-boss">Final boss</a></li>
</ul>
</li>
</ul>
<h2 id="what-is-bayesian-anyway"><a class="header-anchor" href="#what-is-bayesian-anyway">What is “Bayesian” anyway?</a></h2>
<p>Just to say a bit more about the confusions in this conversation …</p>
<p>For a while I had misunderstood what position Chapman is arguing against (I’m not sure I understand it now). Chapman’s writing is a continuation of the “great conversation” that took place throughout the 20th century, about the interplay between probability, epistemology and AI. Some camps believe that formal logic is the path to AI. An offshoot of that camp believe that formal logic combined with probability is the path to AI. The latter are often called “Bayesians”. If I understand correctly, Chapman is perhaps part of a third camp that believes formal methods alone cannot lead to AI.</p>
<p>In the aforementioned post, Chapman is addressing a particular link in the chain of justifications for the Bayesian camp: that probability extends formal methods in general. I’m not sure I fully get what that means, but the gist of it seems to be that, supposing you wanted to do logic under uncertainty, then there is exactly one formalism that you should use: probability. This has the benefit of simplifying the great epistemological debate in AI to: “The solution is either be our method (the Bayesian method) or something totally unknown.”</p>
<p>In the case of propositional (0-th order) logic, “extending logic to include uncertainty” and “putting a probability distribution over logical propositions” turn out to be equivalent ideas. For higher-order logics (1st-order and above), there does not appear to be such an equivalence. I’ll just point out that Chapman isn’t saying that probabilistic 1st-order logic is impossible, but that it has not been figured out yet. He cites the Stanford Encyclopedia of Philosophy’s entry on <a href="https://plato.stanford.edu/entries/logic-probability/">logic and probability</a> on this point.</p>
<p>The problem is that I don’t think “Bayesianism” means the same thing today. For instance, those who advocate for Bayesian methods in machine learning, are advocating for Bayesian models of any kind (these days Bayesian neural networks). Formal logic in particular has gone out of vogue (from what I can tell) as a path towards AI (though I think there is still ongoing <a href="https://en.wikipedia.org/wiki/Bayesian_network">Bayesian networks</a> research, which could count as Bayesian logic).</p>
<p>For someone like me who sees the Bayesian position in this much broader sense, I don’t know why you would want to formalize mathematics and/or logic inside probability theory. That just sounds like an unnecessary headache. The standard formalizations of mathematics (e.g. <a href="https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory">Zermelo–Fraenkel set theory</a> (ZF) or <a href="https://en.wikipedia.org/wiki/Type_theory">type theory</a>) allow you to define probability theory within them (that’s what the <a href="https://en.wikipedia.org/wiki/Probability_axioms">Kolmogorov axioms</a> do). Since any formalization of mathematics at least as powerful as ZF set theory is a higher-order logic, that should mean you can freely mix higher order logic and probability. Why do we need to make probability a foundation? We get what we want for free using the standard definitions of things.</p>
<h1 id="the-formalism"><a class="header-anchor" href="#the-formalism">The Formalism</a></h1>
<div class="kdmath">$$
\newcommand{\ex}{\exists}
\newcommand{\fa}{\forall}
\newcommand{\es}{\emptyset}
\newcommand{\and}{\wedge}
\newcommand{\or}{\vee}
\newcommand{\xor}{\veebar}
\newcommand{\subs}{\subseteq}
\newcommand{\mc}{\mathcal}
\newcommand{\mf}{\mathfrak}
\newcommand{\mb}{\mathbb}
\newcommand{\bs}{\boldsymbol}
\newcommand{\ol}{\overline}
\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\N}{\mb{N}}
\newcommand{\B}{\mb{B}}
\newcommand{\Z}{\mb{Z}}
\newcommand{\R}{\mb{R}}
\newcommand{\t}{\tau}
\newcommand{\th}{\theta}
\newcommand{\ep}{\epsilon}
\newcommand{\vep}{\varepsilon}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\tup}[1]{\left(#1\right)}
\newcommand{\atup}[1]{\left\langle#1\right\rangle}
\newcommand{\r}{\bs}
\newcommand{\bool}{\mathrm{Bool}}
\newcommand{\0}{\mathrm{false}}
\newcommand{\1}{\mathrm{true}}
\newcommand{\Mid}{\,\middle|\,}
\newcommand{\O}{\Omega}
\newcommand{\o}{\omega}
$$</div>
<!-- \r{X} is a random variable -->
<p>I’m going to use standard measure-theoretic probability (i.e. the <a href="https://en.wikipedia.org/wiki/Probability_axioms">Kolmogorov axioms</a>) defined within whatever standard formalization of mathematics you like (e.g. set theory or type theory). I won’t address undecidability. Just know that undecidability can be handled the same way that Solomonoff handles it, i.e. by using semimeasures instead of measures (see <a href="http://www.hutter1.net/ai/uaibook.htm">Marcus Hutter’s book</a> for an explanation of this). Meaning, probability may sum to less than 1, where the missing probability is placed on statements that are undecidable.</p>
<h2 id="quick-review-of-probability-theory"><a class="header-anchor" href="#quick-review-of-probability-theory">Quick review of probability theory</a></h2>
<p>For an in-depth review of probability theory, see my post: <a href="/articles/primer-probability-theory">Primer to Probability Theory and Its Philosophy</a>.</p>
<p>First, we need a set of possibilities, $\Omega$, called the <strong>sample set</strong> (or sample space). A random process chooses a particular sample $\omega \in \Omega$. The word “sample” here is not intended to mean a collection of datapoints. I prefer to think of $\o$ as a possible world state.</p>
<p>The triple $(\Omega, E, P)$ (or $(\Omega, E, \mu)$) is called a <a href="https://en.wikipedia.org/wiki/Probability_space">probability space</a>, which is a type of <a href="https://en.wikipedia.org/wiki/Measure_space">measure space</a>.<br />
$P$ (or sometimes $\mu$) is a measure, which is a function mapping subsets of $\Omega$ to real numbers between 0 and 1 (inclusive).<br />
$E \subseteq 2^\Omega$ is the event set, i.e. set of measurable sets of samples $\omega$. By default assume $E = 2^\Omega$. Don’t worry about this detail.<br />
Then $P: E \to [0,1]$.</p>
<p>A probability measure is different from a probability mass (or density) function. $P$ is a function of sets. A probability mass function, e.g. $F : \Omega \to [0,1]$, is a function of samples.</p>
<p>$P(\set{\omega_1, \omega_2, \omega_3})$ is valid.<br />
$P(\set{\omega})$ is valid.<br />
$P(\omega)$ is invalid. You should write $F(\omega)$ where $F$ is the probability mass function for $P$, i.e. $F(\omega) = P(\set{\omega})$.</p>
<p>Having two kinds of probability functions is annoying, so we can pull some notational tricks. For example, leave out function call parentheses:</p>
<div class="kdmath">$$
P\set{\omega} = P(\set{\omega})\,.
$$</div>
<p>Another trick is to use random variables:</p>
<div class="kdmath">$$
P(\r{X} = x)
$$</div>
<p>expands out to</p>
<div class="kdmath">$$
P\set{\omega \in \Omega \mid \r{X}(\omega) = x}
$$</div>
<p>where the contents of $P(\ldots)$ are a boolean expression (i.e. a logical proposition), and bold variables become functions of $\omega$ in the set-constructor version.</p>
<p>The bold variables are called random variables, which are (measurable; but don’t worry about this) functions from $\Omega$ to any type you wish (whatever is useful for expressing the idea you want to convey).</p>
<p>We can have the identity random variable $\r{\o} : \Omega \to \Omega : \omega \mapsto \omega$, and write</p>
<div class="kdmath">$$
P(\r{\o} = \omega^\*) = P\set{\omega \in\Omega \mid \r{\o} = \omega^\*}
$$</div>
<p>Note that the expression $P(\r{\o})$ is invalid because the contents of $P(\ldots)$ are not a logical proposition. The contents must either be a subset of $\Omega$ (specifically an element of $E$ called an event), or a boolean expression that can be used to construct a subset of $\Omega$ (event in $E$).</p>
<p>Note that while I am bolding random variables to remove ambiguity, this is not notationally required, so long as its clear which symbols in $P(\ldots)$ are the random variables.</p>
<h3 id="general-random-variable-notation"><a class="header-anchor" href="#general-random-variable-notation">General random variable notation</a></h3>
<p>My definition may be a bit idiosyncratic. For any boolean-valued random variable, i.e. a (measurable) function $\r{Z}: \O \to \bool$, we define</p>
<div class="kdmath">$$
P(\r{Z}) := P\set{\o\in\O \mid \r{Z}(\o)}
$$</div>
<p>which is the $P$-measure of the set of all $\o\in\O$ s.t. $\r{Z}(\o)$ is true. $\set{\o\in\O \mid \mathrm{predicate}(\o)}$ is called <a href="https://en.wikipedia.org/wiki/Set-builder_notation#Sets_defined_by_a_predicate">set-builder notation</a>. Here, our boolean-valued function $\r{Z}$ acts as a logical predicate.</p>
<p>Combining that with the shorthand convention that arbitrary mathematical operations on random variables produces random variables, and we get the usual notation that</p>
<div class="kdmath">$$
P(\mathrm{expr}(\r{X}_1, \r{X}_2, \r{X}_3, \ldots)) = P\set{\omega \in \Omega \mid \mathrm{expr}(\r{X}_1(\omega), \r{X}_2(\omega), \r{X}_3(\omega), \ldots)}\,,
$$</div>
<p>where $\r{X}_1, \r{X}_2, \r{X}_3, \ldots$ are random variables of any type (not necessarily Boolean-valued) and $\mathrm{expr}(\r{X}_1, \r{X}_2, \r{X}_3, \ldots)$ is any boolean-valued expression involving arithemtic operators, functions, and logical expressions.</p>
<p>For example, if $\r{X}_1, \r{X}_2$ are real-valued, then $\r{X}_1 + \r{X}_2 = 5$ is a shorthand for the boolean-valued function $\o \mapsto (\r{X}_1(\o) + \r{X}_2(\o) = 5)$. Another example is $\r{X}_1^2 > 4 \and \sin(\r{X}_2) \in [0,1]$, which is a shorthand for the boolean-valued function $\o\mapsto(\r{X}_1(\o)^2 > 4 \and \sin(\r{X}_2(\o)) \in [0,1])$. In general, anything that you can symbolically do to regular variables $x_1, x_2, x_3 \ldots$ you can do to random variables $\r{X}_1, \r{X}_2, \r{X}_3, \ldots$.</p>
<h2 id="conditional-probability"><a class="header-anchor" href="#conditional-probability">Conditional probability</a></h2>
<p>For sets $A,B\subseteq\O$ (called events) we define <br />
<span class="kdmath">$P(A \mid B) := P(A\cap B)/P(B)\,.$</span></p>
<p>Random variables inside this conditional notation expand out into sets just as before. Thus, for boolean-value random variables $\r{Y}, \r{Z}$, we have</p>
<div class="kdmath">$$
\begin{aligned}
&P(\r{Z} \mid \r{Y}) \\
&\quad= P(\set{\o\in\O\mid\r{Z}(\o)}\mid\set{\o\in\O\mid\r{Y}(\o)}) \\
&\quad= P(\set{\o\in\O\mid\r{Z}(\o)}\cap\set{\o\in\O\mid\r{Y}(\o)})/P\set{\o\in\O\mid\r{Y}(\o)} \\
&\quad= P\set{\o\in\O\mid\r{Z}(\o) \and \r{Y}(\o)}/P\set{\o\in\O\mid\r{Y}(\o)} \\
&\quad= P((\r{Z}\and\r{Y})^{-1}(\1))/P(\r{Y}^{-1}(\1))\,.
\end{aligned}
$$</div>
<p>where $\r{Y}^{-1}(\1)$ is the preimage of $\r{Y}$ on $\1$, i.e. the set of all inputs $\o\in\O$ s.t. $\r{Y}(\o)$ is true. Likewise, $(\r{Z}\and\r{Y})^{-1}(\1)$ is the preimage of the function $\r{Z}\and\r{Y}$ on $\1$.</p>
<p>The intuitive way to think about the conditional probability operator, is that it is a domain-restriction followed by a rescaling. $P(\cdot)$ is adding up probability over the domain $\O$, while $P(\cdot \mid \r{Y})$ is adding up rescaled-probability over the domain $\r{Y}^{-1}(\1)$. The probabilities are rescaled s.t. they sum to 1 over the restricted domain $\r{Y}^{-1}(\1)$.</p>
<p>In general, conditioning on knowns/observations is just a matter of restricting your possibility space to the subspace where the knowns/observations hold true. Pretty straightforward. Later on, we will encounter some gnarly conditional probabilities, so it is very helpful to keep in mind that this operation really denotes a simple idea.</p>
<h2 id="quantifiers-inside-probability"><a class="header-anchor" href="#quantifiers-inside-probability">Quantifiers inside probability</a></h2>
<p>There is no reason why we cannot build a boolean-valued function involving logical quantifiers. Just by using the standard formulation of probability and random variables, we get probabilities of quantifiers for free.</p>
<p>Before getting into the interpretation of such constructions, i.e. “what does it mean to take the probability of a quantifier and why would you want to?”, let’s just go through the notational machinery to see what happens if we blindly obey our definitions.</p>
<p><a href="https://en.wikipedia.org/wiki/First-order_logic">1st order logic</a> (aka predicate logic) is logic with quantifiers, type sets, and predicates. Here, a boolean-valued random variable (function on samples $\o$) acts like a predicate.</p>
<p>The standard <a href="https://en.wikipedia.org/wiki/Quantifier_(logic)">quantifiers</a> are “for-all” and “there-exists”. For example, $\fa a \in A : f(a)$ is a logical proposition (it is a statement that is not a function of any free variable) which says that for all elements $a$ in $A$, $f(a)$ is true, where $f : A \to \bool$ is some predicate, i.e. boolean-valued function on set $A$. We could also make a predicate containing a quantifier, e.g. $g : B \to\bool : b \mapsto \fa a \in A: f(a,b)$ for some multi-agument predicate $f : A\times B \to \bool$ (I’m using <a href="https://math.stackexchange.com/a/1224970">“mapsto” function notation</a>). In this way we can nest predicates or quantifiers.</p>
<p>If we wanted to measure the probability that $\fa x \in X : f(x)$ is true, we need to convert this proposition (i.e. not a function) into a function on sample space $\O$. There are two avenues for doing so, you can make $X$ a function on $\O$ or make $f$ a function on $\O$ (or both).</p>
<p>So we have</p>
<div class="kdmath">$$
P(\fa x \in \r{X} : \r{f}(x)) = P\{\omega \in \Omega \mid (\fa x \in \r{X}(\omega) : \r{f}(\omega)(x))\}\,,
$$</div>
<div class="kdmath">$$
P(\ex x \in \r{X} : \r{f}(x)) = P\{\omega \in \Omega \mid (\ex x \in \r{X}(\omega) : \r{f}(\omega)(x))\}\,.
$$</div>
<p>Here, $\r{X}$ is a set-valued random variable, and $\r{f}$ is a function-valued random variable. Meaning, $\r{X}:\O \to U$ where $U$ is a set of sets (or perhaps the powerset of some set), and $\r{f} : \Omega \to \mathrm{Predicates}$ where $\mathrm{Predicates}$ is a set of functions with signatures $X’ \to \bool$ for all $X’ \in U$. Assume that probability measure $P$ is chosen such that for all $\omega \in \Omega$, $\r{f}(\omega)$ is a valid predicate for domain $\r{X}(\omega)$. This allows us to side-step type issues.</p>
<p>A brief prelude on the interpretation of what we’ve just created: $\r{X}$ is the <strong>domain</strong> of quantification. If the domain is a random variable, we can take that to mean we have uncertainty about what domain to quantify over. If a predicate is a random variable, we can take that to mean we have uncertainty about what predicate to use. I call both of these kinds of uncertainties <strong>definitional uncertainty</strong>. More on that later.</p>
<h2 id="on-structured-possibility-spaces"><a class="header-anchor" href="#on-structured-possibility-spaces">On structured possibility spaces</a></h2>
<p>In general, I want to be able to suppose that $\o\in\O$ is a <strong>structured type</strong> without having to go into what that structure might be every time. By “structured type”, I mean it is composed of other types just like a <a href="https://en.wikipedia.org/wiki/Composite_data_type">struct</a> in programming. Running with the programming analogy, suppose $\O$ is the set of all instances of:</p>
<div class="kdmath">$$
\begin{aligned}
&\texttt{struct Primitives \{ } \\
&\qquad X : 2^\Z, \\
&\qquad f : \mathrm{Predicates} \\
&\qquad v : \bool^n, \\
&\qquad \mathrm{Noise} : \set{0,1}^m,\\
&\texttt{\}}
\end{aligned}
$$</div>
<p>where $2^\Z$ is the powerset (set of all subsets) of $\Z$ (if we include infinite subsets we are straying away from the computing analogy, but hey, this math!). This means that any given $\o\in\O$ specifies a domain $X$, a predicate $f$, a vector of primitive propositions $v$, and a noise vector $\mathrm{Noise}$.</p>
<p>Then it is straightforward to define the following random variables:</p>
<ul>
<li>$\r{X} : \O \to 2^\Z$ is the domain of quantification, e.g. $\fa x \in\r{X} : \r{f}(x)$.</li>
<li>$\r{f} : \O \to \mathrm{Predicates}$ is a predicate.</li>
<li>$\r{v} : \O \to \bool^n$ is a vector of Booleans which we can take to be primitive propositions.</li>
<li>$\r{\mathrm{Noise}} : \O \to \set{0,1}^m$ is a vector of noise.</li>
</ul>
<p>So when I suppose there is a possibility space $\O$ without saying anything about its structure, and I proceed to define a bunch of random variables with very different types, e.g. $\r{X} : \O \to 2^\Z, \r{f} : \O \to \mathrm{Predicates}, \r{v} : \O \to \bool^n, \r{\mathrm{Noise}} : \O \to \set{0,1}^m$, you can see how $\o\in\O$ might be defined such that the random variables are mutually compatible with one-another, i.e. the same $\o$ can simultaneously encode wildly different data-types.</p>
<h1 id="review-of-bayesian-propositional-0-th-order-logic"><a class="header-anchor" href="#review-of-bayesian-propositional-0-th-order-logic">Review of Bayesian Propositional (0-th order) Logic</a></h1>
<div class="kdmath">$$
\newcommand{\P}{\mc{P}}
\newcommand{\Q}{\mc{Q}}
$$</div>
<p>It is instructive to see what so-called Bayesian propositional logic typically refers to. I’ll go through it using the same notation and perspective I’ve introduced above.</p>
<p>In <a href="https://en.wikipedia.org/wiki/Propositional_calculus">propositional logic</a> you start with a set of primitive propositions, which I’ll denote here as $\P_1, \P_2, \P_3, \ldots$ (not to be confused with probability measure $P$). Derived propositions are formed by logical operations on primitive propositions, e.g. $\Q := (\P_1 \and \P_2) \or \neg \P_3$. We wish to discover what <strong>universe</strong> we actually inhabit. That is to say, what are the <em>actual</em> Boolean values of all the primitive propositions $\P_1, \P_2, \P_3, \ldots$. For example, if you observe $\P_2 = \1,\ \P_3 = \1$ and $\Q = \0$, then we can uniquely determine that $\P_1 = \0$ (you can verify this yourself by writing out the <a href="https://en.wikipedia.org/wiki/Truth_table">truth table</a>).</p>
<p>The probabilistic version of propositional logic puts a probability distribution over the primitives $\P_1, \P_2, \P_3, \ldots$, which induces a distribution over the derived propositions such as $\Q$.</p>
<p>Let $P$ be a probability measure on $\O = \bool^n$, the set of all $n$-vectors of truth values for $\P_1, \ldots, \P_n$. Then, for example, the probability that $\P_1 \and \neg\P_2 \and \neg\P_3 \and \ldots \and \P_n$ is true is the probability measure of the boolean vector</p>
<div class="kdmath">$$
P\set{\atup{\1,\0,\0,\ldots,\1}}\,.
$$</div>
<p>We can make this notationally easier by defining our propositions as random variables, i.e. $\r{\P}_i : \bool^n \to \bool : \atup{b_1, \ldots, b_n} \mapsto b_i$ for $i \in \set{1,\ldots,n}$ returns the $i$-th boolean value. Then we can write</p>
<div class="kdmath">$$
\begin{aligned}
&P(\r{\P}_1 \and \neg\r{\P}_2 \and \neg\r{\P}_3 \and \ldots \and \r{\P}_n) \\
&\quad= P\set{\o\in\bool^n \mid \r{\P}_1(\o) \and \neg\r{\P}_2(\o) \and \neg\r{\P}_3(\o) \and \ldots \and \r{\P}_n(\o)} \\
&\quad= P\set{\atup{\1,\0,\0,\ldots,\1}}\,.
\end{aligned}
$$</div>
<p>We can also compute marginal probabilities, e.g. $P(\r{\P}_1) = P\set{\o\in\bool^n \mid \r{\P}_1(\o)} = P\set{\atup{\1, \_, \_, \ldots} \in \bool^n}$ which is the total probability across all Boolean vectors where the first entry is true.</p>
<p>We can straightforwardly compute the probability of more complex expressions, e.g.</p>
<div class="kdmath">$$
P((\r{\P}_1 \and \r{\P}_2) \or \neg \r{\P}_3) = P\set{\o\in\bool^n \mid (\r{\P}_1(\o) \and \r{\P}_2(\o)) \or \neg \r{\P}_3(\o)}\,.
$$</div>
<p>which is the sum of probability of all Boolean $n$-vectors s.t. the proposition $\Q$ from above is true. If we define a derived random variable $\r{\Q} = (\r{\P}_1 \and \r{\P}_2) \or \neg \r{\P}_3$, then we can simply write,</p>
<div class="kdmath">$$
P(\r{\Q}) = P((\r{\P}_1 \and \r{\P}_2) \or \neg \r{\P}_3) = \ldots
$$</div>
<p>Using random variables, we get probabilistic propositional logic “for free”, in the sense that really the standard axioms of probability and mathematics allow us to essentially combine probability with logical statements in the way we’d like.</p>
<p>We can also infer unknowns from knowns. Above I gave the example of uniquely determining $\P_1=\0$ from $\P_2=\1,\P_3=\1,\Q=\0$. Suppose instead we want to infer $\Q$ only given $\P_1=\1,\P_2=\0$. Clearly $\Q$ is underdetermined without knowing $\P_3$, but we can calculate the probability of $\Q=\1$ given $\P_1=\1,\P_2=\0$,</p>
<div class="kdmath">$$
\begin{aligned}
&P(\r{\Q} \mid \r{\P}_1 \and \neg\r{\P}_2) \\
&\quad= P\set{\o\in\bool^n \mid \r{\Q}(\o) \and \r{\P}_1(\o) \and \neg\r{\P}_2(\o)} / \mc{Z} \\
&\quad= P\set{\o\in\bool^n \mid ((\r{\P}_1(\o) \and \r{\P}_2(\o)) \or \neg \r{\P}_3(\o)) \and \r{\P}_1(\o) \and \neg\r{\P}_2(\o)} / \mc{Z} \\
&\quad= P\set{\o\in\bool^n \mid \r{\P}_1(\o) \and \neg\r{\P}_2(\o) \and \neg\r{\P}_3(\o)} / \mc{Z}
\end{aligned}
$$</div>
<p>where $\mc{Z} = P\set{\o\in\bool^n \mid \r{\P}_1(\o) \and \neg\r{\P}_2(\o)}$ is the <a href="https://en.wikipedia.org/wiki/Normalizing_constant">normalizing constant</a>. (Going forward I will denote $\mc{Z}$ as the normalizing constant without expanding it out, to keep things from getting cluttered, and to show that it represents a simple idea: that the probabilities need to be rescaled to the new domain being conditioned on.)</p>
<p>Depending on your choice of $P$, the probability $P(\r{\Q} \mid \r{\P}_1 \and \neg\r{\P}_2)$ may be anywhere between 0 and 1, indicating that $\Q$ is underdetermined for the given knowns.</p>
<p>If you wanted to model stochastic propositions, e.g. $\Q$’s dependency on $\P_1,\P_2,\P_3$ is stochastic, you can just designate part of the sample $\o$ as noise (typically unobserved) and have $\Q$ depend on it. For example, let $\r{N} : \O \to \bool$ be a noise random variable. Define $\r{\Q}’ = \r{Q} \xor \r{N}$ where $\xor$ denotes <a href="https://en.wikipedia.org/wiki/Exclusive_or">exclusive “or”</a>. Then $\r{N}$ randomly flips $\r{Q}$. The probability that $\r{Q}’$ is true given the truth values of $\P_1,\P_2,\P_3$ is given by,</p>
<div class="kdmath">$$
P(\r{Q}' \mid \r{\P}_1 = b_1 \and \r{\P}_2 = b_2 \and \r{\P}_3 = b_3)\,.
$$</div>
<p>Your choice of $P$ determines the distribution on $\r{N}$, i.e. how noisy $\r{Q}’$ is (low chance of bit flipping means $\r{Q}’$ is less noisy).</p>
<p>Note that I assumed we have finitely many primitive propositions. We generally wouldn’t have infinitely many, because if we think of primitive propositions as axioms, that would be like having an infinite number of axioms which is generally more powerful than finitely many axioms. 1-st order logic can be thought of as a kind of logic that allows for infinitely many propositions, e.g. predicate $f$ applied to $\N$ generates the propositions $p(0), p(1), p(2), \ldots$. Later on, I will cover tricky issues that come about when $\O$ is an infinite Cartesian product.</p>
<h1 id="philosophy-of-bayesian-probability"><a class="header-anchor" href="#philosophy-of-bayesian-probability">Philosophy of Bayesian Probability</a></h1>
<p>The core perspective driving everything that will follow, is that the sample set $\O$ respresents our <strong>possible universes</strong> (or possible worlds, depending on your preferred aesthetic), and we are trying to figure out which $\o\in\O$ is actually the case by conditioning on observations to narrow down $\O$ (conditioning on givens is a domain restriction). We will do so by starting with a probability measure $P$ on $\O$ and then progressively narrow down the possibility space by conditioning on observations (i.e. givens). Note that $P$ and $\O$ are never modified during the course of “learning”, i.e. observing data and updating beliefs. (I sympathize if this is counterintuitive. It reminds me of <a href="https://en.wikipedia.org/wiki/Anamnesis_(philosophy)">anamnesis</a>).</p>
<p>The term “Bayesian” doesn’t have a precise meaning, so I will give it one for the purposes of this article. When I say I am being <strong>Bayesian</strong>, or I am using <strong>Bayeisan probability</strong>, that indicates that I have a possibility space $\O$ and a probability measure $P$ on $\O$ called the <strong>prior</strong>. “Bayesian” is in contrast to a framework where there is no probability measure on $\O$, which requires a different way of updating on observations. I’m <strong>NOT</strong> saying that one framework is better than the other. The problem posed is how Bayesian probability can incorporate quantifiers in 1st order logic, and I am meeting that challenge.</p>
<p>It is true that this Bayesian framework implies a particular interpretation on what probability <em>does</em>. The conventional interpretation of “the probability of x” is that it quantifies the fraction of outcomes where “x happens” out of a total number of independent trials. In contrast, I am not supposing that we have repeated trials of anything. I am instead supposing that one particular reality $\o$ is the case out of a space of possible realities $\O$, <strong>for all time</strong>. Nothing needs to be repeating (though it can if you choose the right $\O$ and $P$). In that sense, the Bayesian framework is different from the <strong>frequentist</strong> framework of independent repeating trials (again “frequentist” has no precise definition so I’m giving it one here, and I am not saying that one is better than the other).</p>
<p>At the same time, the Bayesian framework is agnostic to whether probability is objective or subjective. The math is the same either way. There are actually many different notions of Bayesian probability. Conventionally, $\O$ and $P$ are regarded as an agent’s model rather than literally representing the environment. $P$ might reflect what an agent aught to rationally believe (objective Bayesian) or what an agent chooses to believe (subjective Bayesian). Either way, $P\set{\o}$ (or $P(E)$ for some event $E\subset\O$) can be interpreted as a representation of how plausible the agent believes $\o$ is.</p>
<h1 id="motivating-examples"><a class="header-anchor" href="#motivating-examples">Motivating examples</a></h1>
<p>Before addressing Chapman’s Bayesian 1st order logic examples, I want to demonstrate that sensible use cases for combining Bayesian probability and 1st order logic. I’ll go through two motivating examples, which also serve as intuition pumps for when we tackle Chapman’s examples later.</p>
<h2 id="motivating-example-1-unicorns"><a class="header-anchor" href="#motivating-example-1-unicorns">Motivating Example 1: Unicorns</a></h2>
<p>What is the probability that there exists a unicorn?<br />
What is the probability that all unicorns have one horn?</p>
<p>These may seem like absurd questions, but they can be made rigorous.<br />
Suppose I model reality in the way we did above. I have a space of primitive states $\Omega$, and derived objects which are functions on $\Omega$.</p>
<p>An extreme example would be that I have some kind of model of physics, and the state of the universe $\omega \in \Omega$ is a space-time block of matter-energy. I also have a unicorn recognition function $\mathrm{IsUnicorn} : \zeta \to \bool$ which takes as input a subset of matter-energy in some space-time region of $\omega$, denoted as $\zeta$, and returns whether $\zeta$ is a unicorn. We can think of $\zeta$ as a <strong>substate</strong> of $\omega$, since $\zeta$ is some piece of it. Specifically, $\zeta = \mathrm{Slice}_{R}(\omega)$ where $R$ is a spacetime region of $\omega$. Since $\mathrm{Slice}_{R}$ is a function of state, it is also a random variable (assuming it is measurable). If we construct the set of all slices obtained at all spacetime regions $Z(\omega) = \set{\mathrm{Slice}_{R}(\omega)}_R$, then this too is a random variable. Then we can query</p>
<div class="kdmath">$$
\ex \zeta \in Z(\omega) : \mathrm{IsUnicorn}(\zeta)\,,
$$</div>
<p>which asks if there exists a slice of $\omega$ that is a unicorn, in effect, does there exist a unicorn (given that the universe has state $\omega$).</p>
<p>If we put a prior (probability measure) $P$ over state space $\Omega$ (reflecting our lack of knowledge about what state the universe is actually in), then we can write</p>
<div class="kdmath">$$
\begin{aligned}
& P(\ex \zeta \in \bs{Z} : \mathrm{IsUnicorn}(\zeta)) \\
&\quad = P\set{\omega \in \Omega \mid \ex \zeta \in \bs{Z}(\omega) : \mathrm{IsUnicorn}(\zeta)}\,.
\end{aligned}
$$</div>
<p>I’m using bold to denote what is a random variable, though this is not notationally necessary.</p>
<p>Note that this probability is not a property of the universe, but a property of our model of the universe (the sample space $\Omega$ and probability measure $P$).</p>
<h3 id="definitional-uncertainty"><a class="header-anchor" href="#definitional-uncertainty">Definitional uncertainty</a></h3>
<p>Another source of uncertainty about whether unicorns exist, is from <strong>definitional uncertainty</strong>. I might be unsure about what constitutes a unicorn. For example, is any horse with a horn considered a unicorn? If my model includes horses, I should be able to compute the probability that there exist horses with horns. Can a unicorn have two horns? If I entered a VR simulation and encountered a unicorn, does that count as unicorns existing? Rather than commit arbitrarily to a particular definition, I can extend Bayesian evenhandedness to choices of definition, and more broadly, choices of ontology (how I conceptually divide up experience), by creating a mixture over the possibilities.</p>
<p>For example, suppose I am unsure about whether to allow unicorns to have two horns (as well as one). Let’s suppose I have two versions of my function from above: $\mathrm{IsUnicorn}_1(\zeta)$ is strict and only allows unicorns to have exactly one horn, while $\mathrm{IsUnicorn}_2(\zeta)$ is lax and allows unicorns to have one or two horns. Then, if I choose $\mathrm{IsUnicorn}_1$ as my definition for unicorn, all unicorns have one horn. Otherwise, if I choose $\mathrm{IsUnicorn}_2$ then not all unicorns necessarily have one horn. Suppose we also have an auxillary function $\mathrm{NumHorns}(\zeta) \to \N$, then we can formalize the query:</p>
<div class="kdmath">$$
\fa \zeta \in Z(\omega) : (\mathrm{IsUnicorn}_n(\zeta) \implies \mathrm{NumHorns}(\zeta)=1)\,,
$$</div>
<p>for $\omega \in \Omega$ and $n \in \set{1,2}$.</p>
<p>Now if we consider $n$ to be part of our primitive state and put a prior $P$ over the combined state $(\omega,n) \in \Omega\times\set{1,2}$, we can write</p>
<div class="kdmath">$$
\begin{aligned}
& P(\fa \zeta \in \bs{Z} : (\bs{\mathrm{IsUnicorn}}(\zeta) \implies \mathrm{NumHorns}(\zeta)=1)) \\
&\quad = P\set{(\omega,n) \in \Omega\times\set{1,2} \mid \fa \zeta \in \bs{Z}(\omega) : (\bs{\mathrm{IsUnicorn}}_n(\zeta) \implies \mathrm{NumHorns}(\zeta)=1)}\,.
\end{aligned}
$$</div>
<p>This probability is strictly between 0 and 1 if I put non-zero probability on $\mathrm{IsUnicorn}_1$ and $\mathrm{IsUnicorn}_2$, since the proposition is true for the former and false for the latter.</p>
<p>In a sense, you can think of an agent with such a prior as supposing two different kinds of universes, each with state-uncertainty. In one universe unicorns have exactly one horn. In the other universe, they can have one or two horns. At this point, it is more sensible to consider $P$ and $\O$ to be the agent’s model, rather than some physically meaningful possibility space (or perhaps a combination of both).</p>
<p>Note that definitional uncertainty is different from <a href="https://www.lesswrong.com/tag/logical-uncertainty">logical uncertainty</a>, which is uncertainty about necessary logical implications from given axioms due to computational limitations. That is to say, logical uncertainty is uncertainty about what is logically true within a formal system (given infinite compute) vs uncertainty about what formalisms are to be used (informally, uncertainty about how to slice up your perceived reality into definitions and concepts).</p>
<h2 id="motivating-example-2-rigged-poker"><a class="header-anchor" href="#motivating-example-2-rigged-poker">Motivating Example 2: Rigged Poker</a></h2>
<p>Suppose I’m playing poker, and I have reason to suspect the dealer has rigged the game by removing cards from the deck. If the set of all cards in a full deck is $\mathrm{FullDeck}$, then a sub-deck is some subset. Let $\mc{D} = 2^\mathrm{FullDeck}$ be the set of all sub-decks. Then for some deck $D \in\mc{D}$, I can pose the logical query:</p>
<div class="kdmath">$$
\ex c \in D : \mathrm{Ace}(c)\,,
$$</div>
<p>meaning “does there exist an ace in the deck?”</p>
<p>Let’s call each possible deck $D \in \mc{D}$ a <strong>hypothesis</strong>. The dealer secretly selects deck $D^* \in \mc{D}$ before play begins, i.e. $D^*$ is the true hypothesis. On each round the dealer shuffles the same deck $D^*$ and deals our hands $h_p, h_d \subset D$, where $h_p$ is the player’s hand (us) and $h_d$ is the dealer’s hand. $h_p, h_d$ are disjoint sets, each containing 5 cards. To simplify things, let’s forget player actions and suppose we immediately show our hands after being dealt.</p>
<p>Each round, hands are uniformly sampled from $D$, and the rounds are independent.</p>
<p>Let $\mc{X}_D$ be the set of all pairs of hands, i.e. $(h_p, h_d) \in \mc{X}_D$. Let $\mu_D$ be the probability measure on $\mc{X}_D$ corresponding to uniformly drawing 10 cards from deck $D$ without replacement (resulting in the two hands). (I use $\mu$ instead of $P$ for notational clarity, because we will soon have multiple different probability measures.) We call $\mc{X}_D$ our <strong>observation space</strong> and $(h_p,h_d) \in \mc{X}_D$ is our <strong>observation</strong> for the round. $\mu_D$ is the <strong>data distribution</strong> under hypothesis $D$. We can compute derived functions like $\mathrm{Wins} : \mc{X}_D \to \bool$, where $\mathrm{Wins}(h_1, h_2)$ returns true if $h_1$ wins over $h_2$. Now we can calculate the probability of winning given we know $D$:</p>
<div class="kdmath">$$
\mu_D(\mathrm{Wins}(\r{H_p}, \r{H_d})) = \mu_D\set{x \in \mc{X}_D \mid \mathrm{Wins}(\r{H_p}(x), \r{H_d}(x))}\,.
$$</div>
<p>where $\r{H_p}(h_p, h_d) = h_p$ and $\r{H_d}(h_p, h_d) = h_d$ are random variables corresponding to the player and dealer hands. Assume that $\mu_D\set{(h_p, h_d)} = 0$ if those hands contain cards not in $D$.</p>
<p>Note that because the hands are dealt i.i.d. from $\mu$ and we play repeatedly, $\mu$ encodes a frequentist probability distribution. However, because $D$ is selected once, there may not be any objective probability associated with its selection. So, to reflect our uncertainty, let’s put a prior probability measure $\pi$ over the set of all decks $\mc{D}$.</p>
<p>Now, let’s compute the probability of winning, while taking our uncertainty about which deck into account. Let $\xi$ be a probability measure over the joint space $\Omega = \set{(D, x) \mid D \in\mc{D},\ x\in\mc{X}_D}$. We call $\xi$ a <strong>mixture</strong>, and it is derived from $\mu$ and $\pi$, since the marginal probability of getting hands $(h_p, h_d)$ a $\pi$-weighted sum of $\mu$-probabilities:</p>
<div class="kdmath">$$
\begin{aligned}
& \xi(\r{H_p}=h_p, \r{H_d}=h_p) \\
&\quad= \xi\set{(D, (h_p, h_d)) \mid D \in \mc{D}} \\
&\quad= \sum_{D\in\mc{D}}\pi\set{D} \mu_D\set{(h_p, h_d)}\,.
\end{aligned}
$$</div>
<p>The probability of winning is:<br />
<span class="kdmath">$\begin{aligned}
& \xi(\mathrm{Wins}(\r{H_p}, \r{H_d})) \\
& \quad = \xi\set{D \in \mc{D},\ x \in \mc{X}_D \mid \mathrm{Wins}(\r{H_p}(x), \r{H_d}(x))} \\
&\quad = \sum_{D\in\mc{D}}\pi\set{D} \mu_D\set{x \in \mc{X}_D \mid \mathrm{Wins}(\r{H_p}(x), \r{H_d}(x))}\,,
\end{aligned}$</span></p>
<p>which is the sum over $D\in\mc{D}$ of each probability of winning given $D$ weighted by $\pi\set{D}$.</p>
<p>We can also compute the probability that our query from above is true:</p>
<div class="kdmath">$$
\xi(\ex c \in \r{D} : \mathrm{Ace}(c)) = \xi\set{D \in \mc{D} \mid \ex c \in \r{D}(D) : \mathrm{Ace}(c)}\,.
$$</div>
<p>where $\r{D} : D \mapsto D$ is the deck random variable, which is just the identity function that returns the input deck $D$.</p>
<h3 id="probabilities-of-probabilities"><a class="header-anchor" href="#probabilities-of-probabilities">Probabilities of probabilities</a></h3>
<p>An interesting question we can ask is the probability that the probability of winning is greater than some constant $p \in [0,1]$. This is a sensible question, since different choices of deck vary my probability of winning. Formally, we write this question as:</p>
<div class="kdmath">$$
\xi(\r{\mu}(\mathrm{Wins}(\r{H_p}, \r{H_d})) > p)\,.
$$</div>
<p>We risk running into notational confusion due to the nested layers of probability measure and their random variables. Conveniently, I gave our outer probability (the model probability) $\xi$ a different symbol from our hypothesis probability $\mu_D$. Now I am making the hypothesis choice itself a random variable, denoted by bold $\r{\mu}$ without the $D$ subscript, with the understanding that $\mu_{(\cdot)}$ is a function of $D$ which returns probability measures over data space $\mc{X}_D$. The random variables $\r{H_p}, \r{H_d}$ are for $\mu_D$, while $\r{\mu}$ is a random variable for $\xi$. This is something that ideally could be distinguished with notation.</p>
<p>The above expression expands out to</p>
<div class="kdmath">$$
\begin{aligned}
& = \xi\set{D \in \mc{D} \mid \r{\mu}_D(\mathrm{Wins}(\r{H_p}, \r{H_d})) > p} \\
& = \xi\set{D \in \mc{D} \mid \r{\mu}_D\set{x \in \mc{X}_D \mid \mathrm{Wins}(\r{H_p}(x), \r{H_d}(x))} > p}\,.
\end{aligned}
$$</div>
<p>Let’s do one more, this time adding in a quantifier. The probability that for all dealer hands, our probability of winning given the dealer hand is greater than $p$. Formally we can write this as:</p>
<div class="kdmath">$$
\begin{aligned}
& \xi(\fa (\_, h_d) \in \r{\mc{X}} : \r{\mu}(\mathrm{Wins}(\r{H_p}, h_d) \mid \r{H_d}=h_d) > p) \\
&\quad = \xi\set{D \in \mc{D} \mid \fa (\_, h_d) \in \r{\mc{X}}_D : \r{\mu}_D(\mathrm{Wins}(\r{H_p}, h_d) \mid \r{H_d}=h_d) > p}
\end{aligned}
$$</div>
<p>$\r{\mu}_D(\mathrm{Wins}(\r{H_p}, h_d) \mid \r{H_d}=h_d)$ expands to $\r{\mu}_D(\mathrm{Wins}(\r{H_p}, h_d))/\mc{Z}$ where $\mc{Z} = \r{\mu}_D(\r{H_d}=h_d) = \r{\mu}_D\set{x \in \r{\mc{X}}_D \mid \r{H_d}(x) = h_d}$ is the normalizing constant.</p>
<p>Putting it all together, we get</p>
<div class="kdmath">$$
\xi\set{D \in \mc{D} \Mid \fa (\_, h_d) \in \r{\mc{X}}_D : \left(\r{\mu}_D\set{x \in \r{\mc{X}}_D \mid \mathrm{Wins}(\r{H_p}(x), h_d)}/\mc{Z} > p\right)}\,.
$$</div>
<h3 id="probability-precision-part-i"><a class="header-anchor" href="#probability-precision-part-i">Probability precision (part I)</a></h3>
<p>Note that we could also ask for the probability that the probability of winning equals $p$,</p>
<div class="kdmath">$$
\xi(\r{\mu}(\mathrm{Wins}(\r{H_p}, \r{H_d})) = p)\,.
$$</div>
<p>Because we only have finitely many hypotheses ($\mc{D}$ is finite), there are only finitely many different winning probabilities, which we can denote as $p_{D;\mathrm{Win}} = \mu_D(\mathrm{Wins}(\r{H_p}, \r{H_d}))$. So the probability $\xi(\r{\mu}(\mathrm{Wins}(\r{H_p}, \r{H_d})) = p_{D;\mathrm{Win}})$ is at least $\pi(D)$, and larger if there are other decks $D’$ with the same winning probability. If $p \neq p_{D;\mathrm{Win}}$ for all $D\in\mc{D}$ then $\xi(\r{\mu}(\mathrm{Wins}(\r{H_p}, \r{H_d})) = p_{D;\mathrm{Win}}) = 0$.</p>
<p>In general, though, probabilities are real valued. The space of possible hypothesis probabilities are limited by the cardinality of the hypothesis and data spaces. That is to say, with finitely many hypotheses and possible outcomes, there are only finitely many possible hypothesis probabilities. This in some sense limits the precision of hypothesis probabilities.</p>
<h3 id="time-series"><a class="header-anchor" href="#time-series">Time-series</a></h3>
<p>In the above example we only considered the first round. If we play multiple rounds and $D^*$ remains fixed, we gain information about $D^*$ over time. We need to think of our observations as sequences of hands. Let $\mc{X}_D$ be the space of all infinite sequences of hands of the form $x = \atup{(h^{(1)}_p, h^{(1)}_d), (h^{(2)}_p, h^{(2)}_d), \ldots}$. As before, $\mu_D$ is a probability measure over $\mc{X}_D$ which we call a hypothesis.</p>
<p>Also as before, we shall put a prior $\pi$ over $\mc{D}$, and define the resulting model $\xi$ (joint probability) on $\Omega = \set{(D, x) \mid D \in\mc{D},\ x\in\mc{X}_D}$.<br />
Let $x_{i:j} = \atup{(h^{(i)}_p, h^{(i)}_d), (h^{(i+1)}_p, h^{(i+1)}_d), \ldots, (h^{(j)}_p, h^{(j)}_d)}$ denote the slice of sequence $x$ from time $i$ to time $j$, and let $\r{X}_{i:j}$ be the random variable mapping $x$ to $x_{i:j}$.</p>
<p>Then we have</p>
<div class="kdmath">$$
\begin{aligned}
&\xi(\r{X}_{1:n} = x_{1:n}) \\
&\quad= \xi\set{(D, y) \in \Omega \mid \r{X}_{1:n}(y) = x_{1:n}} \\
&\quad= \sum_{D\in\mc{D}}\pi\set{D} \mu_D\set{y \in \mc{X}_d \mid \r{X}_{1:n}(y) = x_{1:n}}\,.
\end{aligned}
$$</div>
<p>The <strong>posterior</strong> $\xi(\r{D} = D \mid \r{X}_{1:n} = x_{1:n})$ is our Bayesian distribution over $\mc{D}$ given that we observed $x_{1:n}$.</p>
<h3 id="the-reality-of-probability"><a class="header-anchor" href="#the-reality-of-probability">The reality of probability</a></h3>
<p>Note that if the dealer chose a different deck $D$ every round by sampling i.i.d. from $\mc{D}$, then we don’t learn much about $D$ from observations because $\xi(\r{D} = D \mid \r{X}_{1:n} = x_{1:n}) = \pi(\r{D}=D)$ is independent of $\r{X}_{1:n}$. At the same time, if the decks are re-chosen from $\mc{D}$ every round in an i.i.d. fashion, then there reasonably exists a frequentist distribution over $\mc{D}$, i.e. the distribution the dealer uses to sample $D$. If we choose $\pi$ to be something other than that objective distribution, we will ultimately be wrong about our frequency calculations.</p>
<p>On the other hand, if $D^*$ is chosen once and held fixed for all time (as we originally supposed), then it is not clear whether there is an objective distribution on $\mc{D}$, in the sense that different choices of prior $\pi$ result in “correct” or “incorrect” probability calculations.</p>
<p>An interesting question (that I hope to address in some future article) is why the dealer’s choice to sample $D$ i.i.d. from $\mc{D}$ (let’s say using some chance device), vs choosing to not re-draw $D$, should alter whether there is an objective probability distribution on $\mc{D}$. I believe the answer to this question lies in the physical properties of the chance device, i.e. whether we view it as deterministic or inherently random. If the chance device is classical (and thus deterministic), i.e. a coin toss, we might suppose the coin was causally effected by a reservoir of <a href="http://www.scholarpedia.org/article/Algorithmic_randomness">algorithmically random</a> (i.e. incompressible) bits stored somewhere in the universe, in which case the number of coin flips made does not alter the so-called algorithmic probability of each toss. At any rate, this is just a fun aside.</p>
<h1 id="david-chapmans-challenge-problems"><a class="header-anchor" href="#david-chapmans-challenge-problems">David Chapman’s Challenge Problems</a></h1>
<div class="kdmath">$$
\newcommand{\bo}{\mathrm{boojum}}
\newcommand{\sn}{\mathrm{snark}}
\newcommand{\ed}{\mathrm{Edward}}
\newcommand{\thg}{\mathrm{Things}}
\newcommand{\ve}{\mathrm{vertebrate}}
\newcommand{\fr}{\mathrm{father}}
\newcommand{\mon}{\mathrm{Monsters}}
\newcommand{\obs}{\mathrm{observations}}
$$</div>
<p>In his blog post, <a href="https://meaningness.com/probability-and-logic">Probability theory does not extend logic</a>, David Chapman goes through a number of examples which he claims don’t make sense, and therefore Bayesian 1st order logic is dead in the water. I’m now going to take up Chapman’s challenge to make his examples well posed, i.e. syntactically valid (only one way to blindly parse the math) and with a reasonable interpretation of what is going on.</p>
<p>Through his “Challenge” problems, Chapman wants to demonstrate that arbitrary nesting of probabilities mixed with quantifiers does not parse, and that generalities cannot be inferred from instances (at least not with Bayesian inference). By making his challenge problems well posed, I hope to demonstrate that both are possible in the Bayesian framework.</p>
<h2 id="boojums-and-snarks"><a class="header-anchor" href="#boojums-and-snarks">Boojums and snarks</a></h2>
<p>You’ll have to read Chapman’s post for the backstory on these examples (something about CL Dodgson, aka Lewis Carroll). I’m just going to state each example and “solve” them one by one.</p>
<p>Chapman writes,<br />
<!-- ![Excerpt](https://i.imgur.com/az4qvKn.jpg) --></p>
<figure><img src="/assets/posts/bayesian-first-order-logic/boojums-and-snarks.jpg" alt="" width="100%" /><figcaption></figcaption></figure>
<p>This is a warmup. We have</p>
<div class="kdmath">$$
P(\bo\mid\sn)
$$</div>
<p>which is just straight forward probability, except that Chapman does not define any random variables. Let’s go ahead and do that for him. Chapman notes that it would not be agreed upon exactly how to formalize $P(\bo \mid \sn)$, but I think that there is a fairly canonical interpretation.</p>
<p>Clearly we have a 2x2 probability table, where the columns are $\bo$ is true vs false, and the rows are $\sn$ is true vs false. For example (and I’m just filling in arbitrary probabilities):</p>
<table>
<thead>
<tr>
<th> </th>
<th>$\bo$</th>
<th>$\neg\bo$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\sn$</td>
<td>0.4</td>
<td>0.05</td>
</tr>
<tr>
<td>$\neg\sn$</td>
<td>0.1</td>
<td>0.45</td>
</tr>
</tbody>
</table>
<p>Perhaps this is a sub-table, i.e. part of a larger table containing information we are marginalizing over. But no matter what, we have a sample set $\O$, and some process has selected the latent outcome $\o\in\O$. Then $\o$ encodes within it all the information me might want to know, i.e. whether $\bo$ and $\sn$ are respectively true.</p>
<p>Perhaps we are selecting one instance of a <em>thing</em> which may or may not be a snark and may or may not be a boojum. If we assert $P(\bo \mid \sn) = 0.4$, what we are actually saying is that if a snark is drawn, then its probability of being a boojum is always 0.4.</p>
<p>It will help to actually formalize Chapman’s examples. I will use random variables. Here, $\r{\bo},\r{\sn} : \O \to \bool$ become random variables that take in $\o$, which contains information about the <em>thing</em> that the random process drew, and tell us the properties we care about.</p>
<div class="kdmath">$$
\begin{aligned}
P(\r{\bo} \mid \r{\sn}) &= \frac{P(\set{\o\in\O \mid \r{\bo}(\o)} \cap \set{\o\in\O \mid \r{\sn}(\o)})}{P\set{\o\in\O \mid \r{\sn}(\o)}} \\
& = \frac{P\set{\o\in\O \mid \r{\bo}(\o) \and \r{\sn}(\o)}}{P(\r{\sn}^{-1}(\1))} \\
& = P((\r{\bo} \and \r{\sn}^{-1})(\1))/\mc{Z}\,.
\end{aligned}
$$</div>
<p>We are merely adding up all the probability that $\o$ is a snark and boojum (the probability of drawing a <em>thing</em> that is a snark and boojum), and then rescaling by the probability that $\o$ is a snark so that the probability across snark <em>things</em> sums to 1. The important point is that <strong>we are marginalizing away all other information contained in $\o$</strong>. It is certainly possible that some kinds of snarks are likely to be boojums, while other kinds are not. Then we are averaging over those two cases. If you conditioned on extra information that distinguishes these two types of snarks, the probability of being a boojum would go up or down.</p>
<p>So under this interpretation, all properties/predicates are referring to a single instance of a <em>thing</em> drawn randomly by some process. I feel that is the simplest interpretation, though it is certainly not the only. But let’s run with it.</p>
<p>Next, Chapman wants us to make sense of the statement</p>
<div class="kdmath">$$
\fa x: P(\bo(x)|\sn(x)) = 0.4
$$</div>
<p>We are re-introduced to $\bo$ and $\sn$ as predicates (i.e. Boolean-valued functions) on the domain $\mon$ (apparently boojums and snarks are monsters). Unlike before, we can have two different instances of things that exist simultaneously, $x,y\in\mon$, and then ask about $\bo(x),\bo(y),\sn(x),\sn(y)$. This changes our interpretation. Before, I assumed we selected one thing out a space of possibilities. Now, we have an entire set of things, $\mon$ (possibly an infinity of them), that we can arbitrarily query simultaneously, e.g. $\bo(x) \and \neg \bo(y) \or (\neg \sn(x) \and \sn(y)) \or \bo(z)$ for some $x,y,z\in\mon$. We can talk about all things all at once using quantifiers:</p>
<div class="kdmath">$$
\fa x \in \mon : \sn(x) \implies \bo(x)\,,
$$</div>
<p>or perhaps</p>
<div class="kdmath">$$
\ex x \in \mon : \sn(x) \and \neg \bo(x)\,.
$$</div>
<p>So what is being randomly drawn here? Clearly not the things (monsters), because they all exist at once. Let’s consider the truth table,</p>
<table>
<thead>
<tr>
<th>Monster</th>
<th>snark</th>
<th>boojum</th>
</tr>
</thead>
<tbody>
<tr>
<td>Edward</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<td>Zachary</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
<tr>
<td>$x_i$</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</tbody>
</table>
<p>This lists out the truth values of the same predicates on every thing in $\mon$. Now imagine that we randomly select one of these (possibly infinitely long) tables, meaning that there is a joint probability distribution over every truth value in the table.</p>
<p>Now we can make sense of Chapman’s example. Let’s rewrite it with random variables:</p>
<div class="kdmath">$$
\fa x \in \mon: \tup{P(\r{\bo}(x)|\r{\sn}(x)) = 0.4}
$$</div>
<p>I’m making the predicates random variables, $\r{\bo},\r{\sn} : \O \to (\mon \to \bool)$, i.e. function-valued functions (see my discussion towards the top of this article if that confuses you). $\r{\bo}$ and $\r{\sn}$ take in a randomly chosen state $\o$ which contains all the truth values in our truth table above ($\O$ contains all possible such tables), and these random variables output a function which takes in a thing and returns a Boolean. Another way to think about it is that $\r{\bo}(\o)$ is a function that lets you retrieve a value from a specific row in the truth table, where the row is identified by a $\mon$ instance which you pass as input. In a sense, $\r{\bo}(\o)$ is a database query function, and $\o$ is the database (think MySQL table).</p>
<p>The statement above unpacks into</p>
<div class="kdmath">$$
\begin{aligned}
& \fa x \in \mon: \tup{P(\r{\bo}(x)|\r{\sn}(x))=0.4} \\
&\quad= \fa x \in \mon: \tup{P(\set{\o\in\O \mid \r{\bo}(\o)(x)} \cap \set{\o\in\O \mid \r{\sn}(\o)(x)})/\mc{Z} = 0.4} \\
&\quad= \fa x \in \mon: \tup{P\set{\o\in\O \mid \r{\bo}(\o)(x) \and \r{\sn}(\o)(x)}/\mc{Z} = 0.4}\\
&\quad= \fa x \in \mon: \tup{P((\r{\bo}(x) \and \r{\sn}(x))^{-1}(\1))/\mc{Z} = 0.4}\,,
\end{aligned}
$$</div>
<p>where $\mc{Z} = P\set{\o\in\O \mid \r{\sn}(\o)(x)}$.</p>
<p>This might be a headache to wrap your head around, but it’s mathematically well formed. This is simply saying that, for every monster, if we randomly pick a truth table on which it’s a snark, the propability that it’s also a boojum on that table is 0.4 [<em>edit: thank you <a href="https://www.lesswrong.com/users/bunthut">bunthut</a> for correcting my description here</em>]. Again we are marginalizing over additional information that might be in the table (e.g. other columns), so this is true on average.</p>
<h2 id="inferring-a-generality-from-instances"><a class="header-anchor" href="#inferring-a-generality-from-instances">Inferring a generality from instances</a></h2>
<p>Chapman wants to know how Bayesian probability let’s you infer generalities from specific instances. Suppose we observe</p>
<div class="kdmath">$$
\r{\obs} = \r{\sn}(\ed) \and \r{\bo}(\ed)\,.
$$</div>
<p>Here $\r{\obs}$ is a random variable because it is composed of random variables (we could say it’s a proposition-valued random variable), i.e. it is the function $\r{\obs}(\o) = \r{\sn}(\o)(\ed) \and \r{\bo}(\o)(\ed)$. To condition on the set $\r{\obs}^{-1}(\1) = \set{\o\in\O \mid \r{\obs}(\o)}$ means to perform a domain restriction to all $\o$ encoding probability tables with that have the row:</p>
<table>
<thead>
<tr>
<th>Monster</th>
<th>snark</th>
<th>boojum</th>
</tr>
</thead>
<tbody>
<tr>
<td>Edward</td>
<td>True</td>
<td>True</td>
</tr>
</tbody>
</table>
<p>Now we may ask, does the probability that all snarks are boojums go up if we condition on our observation that Edward is a snark and a boojum? Formally,</p>
<div class="kdmath">$$
\begin{aligned}
&P(\fa x \in \mon : \r{\sn}(x) \implies \r{\bo}(x) \mid \r{\obs}) \\
&\quad \overset{?}> P(\fa x \in \mon : \r{\sn}(x) \implies \r{\bo}(x))\,.
\end{aligned}
$$</div>
<p>The short answer is that this depends on $P$. To see exactly how $P$ determines our ability to update our beliefs about generalities, it is instructive to take a detour to talk about coin tossing.</p>
<h2 id="interlude-a-tale-of-infinite-coin-tosses"><a class="header-anchor" href="#interlude-a-tale-of-infinite-coin-tosses">Interlude: A tale of infinite coin tosses</a></h2>
<p>A probability distribution on truth tables can be equivalently viewed as a random process that draws a sequence of binary outcomes. It is customary to call such a process a sequence of coin tosses, though these abstract “coins” are not necessarily independent, and can have arbitrary dependencies between their outcomes.</p>
<p>I want to point out that for finite truth tables, which I will call a <strong>finite domain</strong>, 0th order logic and 1st order logic have equivalent power, because quantifiers can be rewritten as finite conjunctions or disjunctions in the 0th order language, i.e. $\fa x : f(x) = f(x_1) \and \ldots \and f(x_n)$. Chapman agrees that Bayesian inference could be said to “extend” 0th order logic, so this works as expected. Our interest is really in <strong>infinite domains</strong>, i.e. infinitely long truth tables, i.e. infinitely many monsters. I’m going to assume countable infinity which demonstrates a non-trivial difference between 0th order and 1st order Bayesian inference (beyond countable infinity things get even weirder and harder to work with). Note that in a countably infinite domain we are already working with uncountable $\O$ (e.g. the set of all infinitely long truth tables or infinite coin tosses).</p>
<p>In finite domains, observing an instance of $f(x_1)$ being true, for some $x_1$, makes you more confident of $\fa x : f(x)$, unless you put 0 marginal prior probability on $f(x_1)$ or 0 prior probability on $\fa x : f(x)$. This is Bayesian 0th order logic.</p>
<p>In a countably infinite domain, observing an instance of $f(x_1)$ (or finitely many of them) may or may not change your confidence of $\fa x : f(x)$. It really depends on your prior.</p>
<p>This is analogous to predicting whether an infinite sequence of coin tosses (again I really just mean some random process with binary outputs) will all come up heads based on the observation that the $n$ tosses you observed all came up heads. Remember that the tosses may be causally dependent. Let our sample set be $\O = \B^\infty = \set{0,1}^\infty$, the set of all countably infinite binary sequences (where 1 is a head and 0 is a tail). For $\o \in \O$, let $\o_{1:n}$ denote the finite length-$n$ prefix of $\o$. Given we observe $\o_{1:n}$, what is the probability that $\o = 111111\ldots$ (an infinite sequence of 1s)? Let’s write this formally using random variables:</p>
<div class="kdmath">$$
P(\r{X}_{n+1:\infty} = 111111\ldots \mid \r{X}_{1:n} = 111111\ldots)\,,
$$</div>
<p>where $\r{X}_i : \O \to \B : \o \mapsto \o_i$ is the random variable for the $i$-th toss, and $\r{X}_{i:j} = (\r{X}_i, \r{X}_{i+1}, \ldots, \r{X}_{j-1}, \r{X}_j)$ is the random variable for the sequence of tosses from $i$ to $j$ (inclusive).</p>
<p>Suppose for all $i \in \N$, $P(\r{X}_i = 1) = \th$. Let’s call this a <strong>Bernoulli prior</strong>. (By the way, a perfectly natural use of a quantifier with probability.) Then the tosses are independent, and $P(\r{X}_{1:n} = 111111\ldots) = \th^n$. Furthermore,</p>
<div class="kdmath">$$
P(X_{1:\infty}=111111\ldots) = \lim_{n\to\infty} P(\r{X}_{1:n} = 111111\ldots) = \lim_{n\to\infty}\th^n\,,
$$</div>
<p>which equals $0$ iff $0 \leq \th < 1$ and equals $1$ iff $\th = 1$. If we model the tosses as being <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">i.i.d.</a>, unless we are certain that all tosses will be heads, we place 0 prior probability on that possibility, even if the probability that any toss will be heads is very close to 1. Of course, the probability of any infinite sequence occurring will also be 0. This would imply that we really cannot say anything about an infinity of outcomes using this prior.</p>
<p>However, we are free to choose some other kind of prior. In fact, an i.i.d. prior, such as the Bernoulli prior, does not update beliefs even on finite outcomes, because</p>
<div class="kdmath">$$
P(\r{X}_j = 1 \mid \r{X}_i = b_i) = P(\r{X}_j = 1) = \theta\,.
$$</div>
<p>This means, a prior where learning-from-data can take place is necessarily non-i.i.d. w.r.t. the “coin tosses”. That is to say, if we want to be able to learn, we want to choose a $P$ where for most $i$ and $j$,</p>
<div class="kdmath">$$
P(\r{X}_j = 1 \mid \r{X}_i = b_i) \neq P(\r{X}_j = 1)\,.
$$</div>
<p>As a side note, this does not mean the random process in question cannot be i.i.d. Bernoulli. If we consider the <a href="https://tinyheero.github.io/2017/03/08/how-to-bayesian-infer-101.html">uniform mixture of Bernoulli hypotheses</a>,</p>
<div class="kdmath">$$
P(\r{X}_{1:n} = b_{1:n}, \r{\Theta}=\theta) = \theta^{\sum_i b_i}(1-\theta)^{n-\sum_i b_i}\,,
$$</div>
<p>then $P(\r{X}_{1:n} = b_{1:n})$ is the marginal Bayesian data distribution, which is certainly not i.i.d. w.r.t. $\r{X}_1, \ldots, \r{X}_n$ (checking this is left as an exercise to the reader). That is what allows you to grow more confident about the behavior of future outcomes (and thus grow more confident about $\theta$) as you observe more outcomes.</p>
<p>Question: What kind of prior would make us more confident that all tosses will be heads from a finite observation? The answer is simple: a prior that puts non-zero probability $\phi$ on infinite heads, i.e.</p>
<div class="kdmath">$$
P(X_{1:n}=111111\ldots) \xrightarrow{n\to\infty} \phi
$$</div>
<p>for $0 < \phi < 1$. That also means the probability of getting any other infinite sequence that is not all heads is exactly $1-\phi$. It also means the first few observations of heads are the most informative, and then we get diminishing returns, i.e.</p>
<div class="kdmath">$$
P(\r{X}_{n+1:\infty} = 1_{1:\infty} \mid \r{X}_{1:n} = 1_{1:n}) = \frac{P(\r{X}_{1:\infty} = 1_{1:\infty})}{P(\r{X}_{1:n} = 1_{1:n})} = \frac{\phi}{P(\r{X}_{1:n} = 1_{1:n})}\,.
$$</div>
<p>(switching to the shorthand $1_{1:\infty} = 111111\ldots$) Since $P(\r{X}_{1:n} = 1_{1:n})$ converges to $\phi$, then $\phi/P(\r{X}_{1:n} = 1_{1:n})$ must converge to 1, and so the probability that the remaining infinite sequence of tosses will all be heads given the first $n$ tosses were heads converges (upwards) to 1 as $n\to\infty$. This makes sense, since if you can infer an infinite sequence from finite data, then the infinite sequence contains finite information. Stated another way, the surprisal (i.e. information gain) $-\lg P(\r{X}_{1:\infty} = 1_{1:\infty}) = \lg\frac{1}{\phi}$ for observing the infinite sequence $1_{1:\infty}$ is finite, and that finite information gain is spread asymptotically across the infinite sequence, i.e. your future information gain diminishes to 0 as you observe more 1s.</p>
<p>Such a prior essentially says that knowledge of infinite heads is finite information to me (or whatever agent uses this prior). That can only be true if I have the privileged knowledge that infinite heads is a likely possibility. If I expect infinite heads (with any probability), then observing finite heads confirms that expectation.</p>
<p>Another way to think about it is, I can only put non-zero probability on countably many infinite sequences out of an uncountable infinity of them (there is one infinite binary sequence for every real between 0 and 1). Any infinite sequence (e.g. all heads) with non-zero probability is being given special treatment.</p>
<p>I think what Chapman is trying to get at is that Bayesian inference is doing something trivial. You can’t get something from nothing. Your prior is doing all the epistemological heavy lifting. Within the confines of a model, you can know that something is true for all $x$, but in the real world you cannot reasonably know something is true in call cases (unless you can prove it logically, but then it would be a consequence of your chosen axioms and definitions). In that case, we might say that a prior that asserts that there is some non-zero probability of getting infinite heads is an unreasonable one, or at least very biased. Removing such priors then means we cannot infer an infinite generality from specific instances.</p>
<p>To reiterate, Bayesian inference just tells you what you already claim to know: “hey your prior says you can infer a generality from specific instances, so I’ll update your beliefs accordingly” vs “your prior models infinitely many instances of $f(x)$ as being independent, so I cannot update your belief about the generality from finitely many instances”. People who view Bayesian inference as the end-all be-all to epistemology are hiding the big open problems in their choice of prior. Chapman appears to be suggesting (and I agree) that the hard problems of epistemology have never really been solved, and we should acknowledge that.</p>
<!--
TODO: actually such a prior can make sense. suppose you know the process producing your coins is either deterministic or random. The Bernoulli mixture example is actually a version of this. Your posterior will put more probability on $\theta=1$ if you observe only heads.
-->
<p>Getting back to boojums and snarks, it should be straightforward to see how $P(\fa x \in \mon : \r{\sn}(x) \implies \r{\bo}(x) \mid \r{\obs})$ would work. Supposing $\mon$ is an infinite set, then if our prior put some non-zero probability on the infinite tables where $\fa x \in \mon : \sn(x) \implies \bo(x)$ is true, then observing that finitely many observations conform to this hypothesis should make our conditional probability go up. On the other hand, if we don’t use such an epistemically strong prior, then the probability of observing an infinite table filled with “True” is 0 - the same probability of observing any individual infinite table. Zero-probability outcomes can never become non-zero just by conditioning on something.</p>
<h2 id="probability-inside-a-quantifier-inside-a-probability"><a class="header-anchor" href="#probability-inside-a-quantifier-inside-a-probability">Probability inside a quantifier inside a probability</a></h2>
<p>Chapman writes,</p>
<!-- ![](https://i.imgur.com/RCaN9Kz.jpg) -->
<figure><img src="/assets/posts/bayesian-first-order-logic/probability-inside-quantifier-inside-probability.jpg" alt="" width="100%" /><figcaption></figcaption></figure>
<p>Why are we not sure about whether “∀x: P(boojum(x)|snark(x)) = 0.4” is true? If we’ve fully defined $\O$ and $P$ and our random variables, $\r{\bo}$ and $\r{\sn}$, then we should be able to determine whether this is true, barring undecidability or computational intractibility, which I’ll classify as logical uncertainty.</p>
<p>I think Chapman is asking: how can observing that Edward is a snark and a boojum should increase our probability that any snark is also a boojum? I agree that the example Chapman gives wouldn’t accomplish this, but let me construct a similar expression that would:</p>
<div class="kdmath">$$
\begin{aligned}
&\fa x \in\mon\setminus\set{\ed} : \\
&\qquad P(\r{\bo}(x) \mid \r{\sn}(x) \and \r{\sn}(\ed) \and \r{\bo}(\ed)) \\
&\qquad \overset{?}> P(\r{\bo}(x) \mid \r{\sn}(x))\,.
\end{aligned}
$$</div>
<p>Whether this is true will depend on what kind of prior $P$ we choose (see the coin tossing discussion above).</p>
<p>Chapman also claims that nesting a probability inside a quantifier inside a probability, as he does in his example, cannot work. I agree with Chapman that nesting $P$ inside $P$ generally doesn’t make sense, though technically his example will parse to <em>something</em> if we make the usual random variable replacements I’ve been making above.</p>
<p>However, nesting <strong>different</strong> measures can make sense. Suppose we didn’t want to settle for one $P$, and instead we had a set $\mc{M}$ of measures $\mu$ on $\O$. In the Bernoulli coin toss example above, $\mc{M}$ is the set of all i.i.d. Bernoulli measures infinite coin tosses. To make a prediction, we average the predictions across $\mu\in\mc{M}$ according to our <strong>hypothesis prior</strong>, which is a measure $\pi$ on $\mc{M}$. Let’s define $P$ as the derived measure on $\mc{M}\times\O$, i.e.</p>
<div class="kdmath">$$
P(\r{\mu} = \mu, \r{\o} = \o) = \pi\set{\mu}\cdot \mu\set{\o}\,.
$$</div>
<p>(Technically this expression would often be 0 since $\O$ is uncountable, but hopefully you get the idea.)</p>
<p>The expression of interest becomes</p>
<div class="kdmath">$$
P\tup{\fa x \in\mon : \r{\mu}(\r{\bo}(x) \mid \r{\sn}(x)) = 0.4 \Mid \r{\sn}(\ed) \and \r{\bo}(\ed)}
$$</div>
<p>where $\r{\mu}: \mc{M}\times\O \to \mc{M} : (\mu,_) \mapsto \mu$ is a measure-valued random variable, and $\r{\bo},\r{\sn} : \O \to (\mon \to \bool)$ are the usual predicate-valued random variables on $\O$, that means they are random variables for the inner measure $\r{\mu}$.</p>
<p>In words, this expression is asking for the $P$-probability of all $(\mu,\o) \in \mc{M}\times\O$ s.t. $\fa x \in\mon : \mu(\r{\bo}(x) \mid \r{\sn}(x)) = 0.4$ holds, restricted to truth tables $\o$ where $\r{\sn}(\o)(\ed) \and \r{\bo}(\o)(\ed)$ is true.</p>
<p>This is actually a reasonable query to make. Going back to the Bernoulli coin tossing example, if we observe the first coin toss is heads ($b_1 = 1$), that will put more posterior probability on hypotheses $\theta$ where 1 is more likely than 0, i.e.</p>
<div class="kdmath">$$
P(\r{\Theta} > 0.5 \mid \r{X}_1 = 1) > P(\r{\Theta} > 0.5)\,.
$$</div>
<p>In general we can ask some arbitrary question about the posterior probability (after making in observation) of some hypotheses satisfying a query, i.e.</p>
<div class="kdmath">$$
P(\mathrm{predicate}(\r{\Theta}) \mid \r{X}_1 = 1)\,,
$$</div>
<p>where $\mathrm{predicate} : \Theta \to \bool$ is some Boolean-valued function of Bernoulli parameter $\theta$. That predicate can involve a quantifier and the data probability under hypothesis $\theta$. For example, we can reenact the same example above:</p>
<div class="kdmath">$$
P(\fa i \in \set{2,3,\ldots} : P(\r{X}_i = 1 \mid \r{\Theta}) > 0.5 \mid \r{X}_1 = 1)\,.
$$</div>
<p>This is the $P$-probability of all $(\theta, b_{1:\infty}) \in \Theta\times\B^\infty$ s.t. the probability of every future outcome under the $\theta$-hypothesis is greater than 0.5, restricted to coin toss sequences $b_{1:\infty}$ where $b_1 = 1$. I do expect this probability to be larger than $P(\fa i \in \set{2,3,\ldots} : P(\r{X}_i = 1 \mid \r{\Theta}) > 0.5)$ (not conditioning on data) if I expect that hypotheses where the probability of heads is greater than a half (i.e. $P(\r{X}_i = 1 \mid \r{\Theta}=\theta) > 0.5$) will get upweighted by observing a head.</p>
<h3 id="probability-precision-part-ii"><a class="header-anchor" href="#probability-precision-part-ii">Probability precision (part II)</a></h3>
<p>Quick side note regarding the question</p>
<blockquote>
<p>Suppose the probability is actually 0.400001, not 0.4? Does that make the statement false?</p>
</blockquote>
<p>Yes. This “paradox” of probability on continuous sets is straight out of probability 101. The probability of drawing a real number uniformly from the unit interval is 0. What you actually care about is the probability of drawing a real number in some subinterval. In this case,</p>
<div class="kdmath">$$
P(\r{\bo}(x) \mid \r{\sn}(x)) \in (0.4-\vep, 0.4+\vep)
$$</div>
<p>is acceptable, and we can choose whatever precision $\vep$ we want.</p>
<h2 id="final-boss"><a class="header-anchor" href="#final-boss">Final boss</a></h2>
<div class="kdmath">$$
\newcommand{\eql}{\mathrm{equal}}
\newcommand{\lts}{\mathrm{looksTheSame}}
\newcommand{\dna}{\mathrm{DNA\_match}}
\newcommand{\obs}{\mathrm{observations}}
\newcommand{\ar}{\mathrm{Arthur}}
\newcommand{\ha}{\mathrm{Harold}}
$$</div>
<p>I believe at this point I’ve covered everything Chapman points out as being ill-posed, namely, arbitrary nesting of probability expressions and quantifiers, and “learning” about generalities from specific instances.</p>
<p>Chapman gives a more involved example that I think is more of the same idea, but I will go through it just for fun. Chapman writes,</p>
<!-- ![](https://i.imgur.com/AVh3JBe.jpg) -->
<figure><img src="/assets/posts/bayesian-first-order-logic/father-relation.jpg" alt="" width="100%" /><figcaption></figcaption></figure>
<!-- ![](https://i.imgur.com/hPWGOqQ.jpg) -->
<figure><img src="/assets/posts/bayesian-first-order-logic/father-evidence.jpg" alt="" width="100%" /><figcaption></figcaption></figure>
<p>Let’s run with same interpretation as before - we are randomly drawing a truth table where the rows are instances of $\mon$ and the columns are predicates.</p>
<p>It is not clear whether Chapman is asserting, or querying, that the following is true:</p>
<div class="kdmath">$$
\fa x : \ve(x)\implies (\ex y : \fr(y,x) \and (\fa z : \fr(z,x) \implies z=y))\,,
$$</div>
<p>i.e. is Chapman telling us that this is true, or is he querying whether it is true. If we take the predicates here to be columns in a truth table, then this proposition’s truth value depends on the entries of the truth table. Technically we need two tables, one for single-argument properties and one for two-argument properties:</p>
<table>
<thead>
<tr>
<th>Monsters</th>
<th>vertebrate</th>
</tr>
</thead>
<tbody>
<tr>
<td>$x_1$</td>
<td>True</td>
</tr>
<tr>
<td>$x_2$</td>
<td>False</td>
</tr>
<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</tbody>
</table>
<p>and</p>
<table>
<thead>
<tr>
<th>Argument-1</th>
<th>Argument-2</th>
<th>father</th>
</tr>
</thead>
<tbody>
<tr>
<td>$x_1$</td>
<td>$y_1$</td>
<td>True</td>
</tr>
<tr>
<td>$x_1$</td>
<td>$y_2$</td>
<td>False</td>
</tr>
<tr>
<td>$x_2$</td>
<td>$y_1$</td>
<td>False</td>
</tr>
<tr>
<td>$x_2$</td>
<td>$y_2$</td>
<td>True</td>
</tr>
<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</tbody>
</table>
<p>Rewriting with random variables, we have:</p>
<div class="kdmath">$$
\begin{aligned}
& \fa x \in \mon : \\
&\qquad \r{\ve}(x)\implies \\
&\qquad\qquad \ex y \in \mon: \r{\fr}(y,x) \and (\fa z \in \mon: \r{\fr}(z,x) \implies z=y)\,.
\end{aligned}
$$</div>
<p>Again, our random variables are function-valued, $\r{\ve} : \O \to (\mon \to \bool)$ and $\r{\fr} : \O \to (\mon^2 \to \bool)$. Each $\o\in\O$ contains the two tables above (and maybe other information), and $\O$ contains all such combination of tables.</p>
<p>Now Chapman asks how observing specific data can update the truth values of specific propositions. He supposes that “we sequence DNA from some monsters and find that it sure looks like Arthur and Harold are both fathers of Edward.”</p>
<p>Chapman wants to evaluate the following:</p>
<!-- ![](https://i.imgur.com/ScUsQlW.png) -->
<figure><img src="/assets/posts/bayesian-first-order-logic/father-evidence-2.png" alt="" width="100%" /><figcaption></figcaption></figure>
<p>But Chapman doesn’t provide the formal definition of <code class="language-plaintext highlighter-rouge">experiment</code> or <code class="language-plaintext highlighter-rouge">observation</code>. From context clues, I will infer reasonable definitions.</p>
<p>First, we need a way to talk about DNA matches. I’ll invoke a random variable $\r{\dna} : \O \to (\mon^2 \to \bool)$ which is the outcome of a DNA experiment (match or no match) run on $x$ and $y$.</p>
<p>Next, the last expression $P(\ar = \ha \mid \obs)$ implies that one monster can go by two different names. If we regard the set $\mon$ to be a set of names, not the identities of monsters, then we can have uncertainty about the equality of two different names. I’ll invoke the random variable $\r{\eql} : \O \to (\mon^2 \to \bool)$ that returns true if the given names are actually the same monster.</p>
<p>Now the two-argument truth table contains the corresponding extra columns (remember that $\o$ contains an entire such truth table and the one-argument table above):</p>
<table>
<thead>
<tr>
<th>Argument-1</th>
<th>Argument-2</th>
<th>father</th>
<th>equal</th>
<th>DNA_match</th>
</tr>
</thead>
<tbody>
<tr>
<td>$x_1$</td>
<td>$y_1$</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<td>$x_1$</td>
<td>$y_2$</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<td>$x_2$</td>
<td>$y_1$</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<td>$x_2$</td>
<td>$y_2$</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td> </td>
</tr>
</tbody>
</table>
<p>We can write our data as follows:</p>
<!--
$$
\r{\mathrm{data}} = \r{\dna}(\ar, \ed) \and \r{\dna}(\ha, \ed) \and \neg\r{\lts}(\ar, \ha)
$$
-->
<div class="kdmath">$$
\r{\obs} = \r{\dna}(\ar, \ed) \and \r{\dna}(\ha, \ed)\,.
$$</div>
<p>Note that $\r{\obs} : \O \to \bool$ is itself a random variable, being composed of random variables.</p>
<p>Then we can rewrite Chapman’s expressions:</p>
<div class="kdmath">$$
\begin{aligned}
&P(\r{\fr}(\ar, \ed) \mid \r{\dna}(\ar, \ed)) &= 0.99 \\
&P(\r{\fr}(\ha, \ed) \mid \r{\dna}(\ha, \ed)) &= 0.99 \\
&P(\r{\eql}(\ar, \ha) \mid \r{\obs}) &= 0.01
\end{aligned}
$$</div>
<p>I replaced $\ar = \ha$ with $\r{\eql}(\ar, \ha)$, our monster equality random variable (to distinguish from name equality).</p>
<p>From top to bottom, $P(\r{\fr}(\ar, \ed) \mid \r{\dna}(\ar, \ed)) = 0.99$ says that in 99% of truth tables (as measured by $P$; my language here doesn’t imply $P$ has to be uniform, but the idea of a measure is that it tells you how much of something you have) where $\ar$ and $\ed$ DNA match, $\ar$ is $\ed$’s father. Likewise for the middle expression. The last expression, $P(\r{\eql}(\ar, \ha) \mid \r{\obs}) = 0.01$, says that for only 1% of truth tables (as measured by $P$) where $\ar$ and $\ha$ both DNA match $\ed$ and $\ha$, it is the case that $\ed$ and $\ha$ are actually distinct monsters ($\ed$ has two fathers). Why is that? Would the DNA tests be likely to be wrong in that scenario?</p>
<p>I think Chapman actually wanted to say that the probability of our observations is low,</p>
<div class="kdmath">$$
P(\r{\obs}) = 0.01\,,
$$</div>
<p>but given this observation the probability that Edward has two fathers is high, which can allows us to update our belief about vertebrates having two fathers.</p>
<p>Chapman wants to know, is the following meaningful?</p>
<div class="kdmath">$$
\begin{aligned}
& P(\fa x \in \mon : \\
&\qquad \r{\ve}(x)\implies \\
&\qquad\qquad \ex y \in \mon: \r{\fr}(y,x) \and (\fa z \in \mon: \r{\fr}(z,x) \implies \r{\eql}(z, y)) \\
&\quad\mid \r{\obs})
\end{aligned}
$$</div>
<p>Does conditioning on $\r{\obs}$ change the probability that all vertebrates have one father? Are we magically somehow inferring a generality from specific instances, just by virtue of the machinery of probability theory and set theory? While this monstrosity of an equation (pun intended) may be syntactically correct, how do we interpret what it’s doing?</p>
<p>This example is fundamentally the same as the infinite coin tossing example. If we choose a prior that puts non-zero probability on monsters having two fathers, then I expect that updating on observations which are supported by that hypothesis causes the 2-father hypothesis to become more likely. If you choose a prior where no monster can have two fathers, then you cannot update yourself out of that assumption.</p>
<p>Something I think Chapman is trying to get at, is that you cannot update your beliefs on data within the Bayesian framework without first specifying how your beliefs interact with data. Given any Bayesian probability distribution, you cannot pop out a level of abstraction and start updating that Bayesian model itself on data, unless you had the foresight to nest your Bayesian model inside an even larger meta-model that you also defined up front. I’m merely demonstrating that you can define the meta-model if you want to. I do acknowledge that we have a problem of infinite regress, and you need to keep creating ever more elaborate meta-models whenever you want to explain how an rational person should update their ontology given their experience (or you can use Solomonoff induction, i.e. Bayesian inference on all (semi)computable hypotheses).</p>
Sun, 21 Feb 2021 00:00:00 -0800
danabo.github.io/zhat/articles/bayesian-first-order-logic
danabo.github.io/zhat/articles/bayesian-first-order-logicpostPrimer to Probability Theory and Its Philosophy<p>Probability is a measure defined on events, which are sets of primitive outcomes. Probability theory mostly comes down to constructing events and measuring them. A measure is a generalization of size which corresponds to length, area, and volume (rather than the bijective mapping definition of cardinality).</p>
<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#definitions" id="markdown-toc-definitions">Definitions</a> <ul>
<li><a href="#beginner" id="markdown-toc-beginner">Beginner</a></li>
<li><a href="#full-definition" id="markdown-toc-full-definition">Full Definition</a></li>
<li><a href="#kolmogorov-axioms-of-probability" id="markdown-toc-kolmogorov-axioms-of-probability">Kolmogorov axioms of probability</a></li>
<li><a href="#examples" id="markdown-toc-examples">Examples</a></li>
<li><a href="#pmfs-and-pdfs-and-measures-oh-my" id="markdown-toc-pmfs-and-pdfs-and-measures-oh-my">PMFs and PDFs and measures, oh my!</a></li>
<li><a href="#events-vs-samples" id="markdown-toc-events-vs-samples">Events vs samples</a></li>
</ul>
</li>
<li><a href="#constructing-events" id="markdown-toc-constructing-events">Constructing events</a> <ul>
<li><a href="#random-variables" id="markdown-toc-random-variables">Random variables</a> <ul>
<li><a href="#motivation-1-information-hiding" id="markdown-toc-motivation-1-information-hiding">Motivation 1: Information hiding</a> <ul>
<li><a href="#examples-1" id="markdown-toc-examples-1">Examples</a></li>
</ul>
</li>
<li><a href="#motivation-2-syntactic-sugar" id="markdown-toc-motivation-2-syntactic-sugar">Motivation 2: Syntactic sugar</a> <ul>
<li><a href="#probability-distribution-of-a-random-variable" id="markdown-toc-probability-distribution-of-a-random-variable">Probability distribution of a random variable</a></li>
<li><a href="#notational-confusion" id="markdown-toc-notational-confusion">Notational confusion</a></li>
</ul>
</li>
<li><a href="#motivation-3-construct-events-that-are-guaranteed-measurable" id="markdown-toc-motivation-3-construct-events-that-are-guaranteed-measurable">Motivation 3: Construct events that are guaranteed measurable</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#almost-surely" id="markdown-toc-almost-surely">Almost surely</a> <ul>
<li><a href="#throwing-darts" id="markdown-toc-throwing-darts">Throwing darts</a></li>
<li><a href="#borels-law-of-large-numbers" id="markdown-toc-borels-law-of-large-numbers">Borel’s law of large numbers</a></li>
</ul>
</li>
<li><a href="#primer-to-measure-theory" id="markdown-toc-primer-to-measure-theory">Primer to measure theory</a></li>
</ul>
<div class="kdmath">$$
\newcommand{\bin}{\mathbb{B}}
\newcommand{\nat}{\mathbb{N}}
\newcommand{\real}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\d}{\mathrm{d}}
\newcommand{\len}[1]{\ell\left(#1\right)}
\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\bigmid}{\;\middle\vert\;}
$$</div>
<p>Sections:</p>
<ol>
<li><a href="#definitions">Definitions</a> - explain the definition of probability.</li>
<li><a href="#constructing-events">Constructing event</a> - explain random variable notation.</li>
<li><a href="#almost-surely">Almost surely</a> - a philosophical excursion into the interpretation of probability.</li>
<li><a href="#primer-to-measure-theory">Primer to measure theory</a> - a brief introduction to measure theory.</li>
</ol>
<p>Main references:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Probability_axioms#Axioms">https://en.wikipedia.org/wiki/Probability_axioms#Axioms</a></li>
<li><a href="https://en.wikipedia.org/wiki/Measure_space">https://en.wikipedia.org/wiki/Measure_space</a></li>
<li><a href="https://en.wikipedia.org/wiki/Random_variable#Measure-theoretic_definition">https://en.wikipedia.org/wiki/Random_variable#Measure-theoretic_definition</a></li>
<li><a href="http://statweb.stanford.edu/~souravc/stat310a-lecture-notes.pdf">http://statweb.stanford.edu/~souravc/stat310a-lecture-notes.pdf</a></li>
<li><a href="https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf">https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf</a></li>
</ul>
<p>The first half of this article is ostensibly devoted to explaining the definition of probability, but that is not my priority. I’m most interested in providing a useful conceptual map, asking and discussing interesting questions, and developing intuition. I provide many links to technical details and further readings. My opening exposition on definitions is brief. If it does not all make sense, please look at other resources. Hopefully this article at least makes those other sources easier to use.</p>
<p>This post is also a pedagogical experiment. I structured this article to be read twice. The first pass is without measure theory, and the second pass is with measure theory. Measure-theory content is hidden by default, e.g. <span class="advanced outer hidden"><span class="advanced inner hidden">like this</span></span>. Simply ignore <span class="advanced outer hidden"><span class="advanced inner hidden">purple text</span></span> the <span class="marginnote-outer"><span class="marginnote-ref">first time</span><label for="796b5f2304883e2b8a1737199bea4fcbd95c2af7" class="margin-toggle"> ⊕</label><input type="checkbox" id="796b5f2304883e2b8a1737199bea4fcbd95c2af7" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Unless you are already acquainted with measure theory, but then you can just look at <a href="https://en.wikipedia.org/wiki/Probability_axioms#Axioms">Wikipedia’s definition of probability</a> to understand the gist of probability theory.</span></span></span> you read this post. Then in the <a href="#primer-to-measure-theory">measure theory section</a> at the end of this post you will see a button to show all the hidden text (and you can just click on <span class="advanced outer hidden"><span class="advanced inner hidden">purple text</span></span> anywhere in the post to show it).</p>
<p>Why? Because I see plenty of introductions to probability that leave out measure theory entirely. The problem with them is that a lot of the common probability notation, e.g. random variables, only really makes sense when you understand measures. On the other hand, if you crack open a rigorous text on probability theory (e.g. <a href="https://www.goodreads.com/book/show/383472.Statistical_Inference">Casella & Berger</a> or <a href="https://www.springer.com/gp/book/9780387953823">Shao</a>), it may not be obvious why all this extra complexity with events, sigma-algebras and measure spaces is necessary.</p>
<p>When learning about probability and measure theory myself, I wish I had a resource that both provides precise definitions and intuitions for why these definitions are the way they are, and without wading through a lot of extraneous details. I don’t know if I’ve succeeded in that here, but this is my attempt.</p>
<h1 id="definitions"><a class="header-anchor" href="#definitions">Definitions</a></h1>
<h2 id="beginner"><a class="header-anchor" href="#beginner">Beginner</a></h2>
<p>The full definition of probability is below, but to avoid overwhelm, you may first look at this <em>attempt</em> at defining probability. Many people intuitively think of probability this way. Notably, I’ve left out the event space.</p>
<p><strong>Sample set</strong> $\Omega$ is a set of all possible <span class="marginnote-outer"><span class="marginnote-ref">samples</span><label for="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle"> ⊕</label><input type="checkbox" id="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Sample is synonymous with <a href="https://en.wikipedia.org/wiki/Outcome_(probability)">outcome</a>.</span></span></span> $\omega\in\Omega$. A sample is a possible state of the world, e.g. the outcomes for all coins that will be tossed or all dice that will be thrown, or the ordering of cards in a deck.</p>
<p><strong>Probability function</strong> <span class="marginnote-outer"><span class="marginnote-ref">$P : 2^\Omega \to [0, 1]$</span><label for="0574240032f5e0595cbf5977a6d3da50278d4847" class="margin-toggle"> ⊕</label><input type="checkbox" id="0574240032f5e0595cbf5977a6d3da50278d4847" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">$2^\Omega$ is the <a href="https://en.wikipedia.org/wiki/Power_set">power set</a> of $\Omega$. The notation $2^{(\cdot)}$ is just a shorthand, though set exponentiation could be <a href="https://math.stackexchange.com/a/901742">defined in general</a>, e.g. $A^B$ is the set of all functions $f : B \to A$, and $n^A$, where $n$ is a natural number, generates the set of all $n$-ary indicator functions <span class="kdmath">$f : A \to \{0, 1, 2, \ldots, n-1\}$</span>. Then $2^A$ gives us all indicator functions <span class="kdmath">$A \to \{0,1\}$</span> select the elements of every subset of $A$.</span></span></span> gives the probability of a <span class="marginnote-outer"><span class="marginnote-ref">set of samples</span><label for="ad8b3af703fafd99caa4527f4b15b75a1975d039" class="margin-toggle"> ⊕</label><input type="checkbox" id="ad8b3af703fafd99caa4527f4b15b75a1975d039" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">subset of $\Omega$</span></span></span>. A set of samples ${\omega_1, \omega_2, \ldots}$ is called an <strong>event</strong>, which is a set of possible states the world could be in, read as “$\omega_1$ is the case or $\omega_2$ is the case, etc. …”</p>
<p>$P$ satisfies:</p>
<ul>
<li><strong>Non-negativity</strong>: <span class="kdmath">$P(e) \geq 0,\ \forall e \in 2^\Omega$</span>.</li>
<li><strong>Null empty set</strong>: <span class="kdmath">$P(\emptyset) = 0$</span>.</li>
<li><strong>Unit sample set</strong>: <span class="kdmath">$P(\Omega) = 1$</span>.</li>
<li><strong>Additivity</strong>: For all disjoint events <span class="kdmath">$e_1, e_2 \in 2^\Omega,\ P(e_1 \cup e_2) = P(e_1) + P(e_2)$</span></li>
</ul>
<p>The probability of a single sample (outcome) $\omega\in\Omega$ is <span class="kdmath">$P(\{\omega\})$</span>.</p>
<h2 id="full-definition"><a class="header-anchor" href="#full-definition">Full Definition</a></h2>
<p>The beginner definition above does not define an event space. This is actually a problem when working with uncountable sample spaces, because not all subsets of an uncountable space can be measured. If that statement confuses you, don’t worry about it and read through this post. Then read my <a href="#primer-to-measure-theory">primer to measure theory</a> at the end which outlines why not every set can be measured. Though this may seem like a minor technicality, specifying what sets can be measured allows probability theory to be <span class="marginnote-outer"><span class="marginnote-ref">a lot more general than it otherwise could be,</span><label for="a0359426bd6a552c43a72d1b7b5e568b41b95f0d" class="margin-toggle"> ⊕</label><input type="checkbox" id="a0359426bd6a552c43a72d1b7b5e568b41b95f0d" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">This is Kolmogorov’s achievement. Definitions of probability like my beginner definition had been around for hundreds of years prior.</span></span></span> specifically when dealing with real numbers.</p>
<p>Here is a compact but complete definition of probability:</p>
<ul>
<li><strong>Sample set</strong> $\Omega$ is a set of all possible <span class="marginnote-outer"><span class="marginnote-ref">samples</span><label for="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle"> ⊕</label><input type="checkbox" id="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Sample is synonymous with <a href="https://en.wikipedia.org/wiki/Outcome_(probability)">outcome</a>.</span></span></span>.
<ul>
<li><strong>Sample</strong> $\omega \in \Omega$ (i.e. primitive outcome) is a possible state of the world. Samples are disjoint, meaning only one sample can be the case at a time. Samples can be any kind of mathematical object.</li>
</ul>
</li>
<li><strong>Event space</strong> $E \subseteq 2^\Omega$ is the set of subsets of $\Omega$ for which we are <span class="marginnote-outer"><span class="marginnote-ref">allowed to assign probability.</span><label for="f8b43cebd8d878851c96f6b83253241ae0914c18" class="margin-toggle"> ⊕</label><input type="checkbox" id="f8b43cebd8d878851c96f6b83253241ae0914c18" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The sets omitted from $E$ are not measurable, but again if you are not familiar with measure theory don’t worry about why some sets cannot be measured until the end of this post.</span></span></span> We require that $\emptyset, \Omega \in E$ <span class="advanced outer hidden"><span class="advanced inner hidden">and $E$ is required to be a <a href="#sigma-algebra">$\sigma$-algebra</a> that contains the measurable subsets of $\Omega$. The tuple $(\Omega, E)$ is a <a href="https://en.wikipedia.org/wiki/Measurable_space">measurable space</a>.</span></span>
<ul>
<li><strong>Event</strong> $e \in E$ is a <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> set of samples. Samples $\omega \in e$ are <span class="marginnote-outer"><span class="marginnote-ref">considered identical</span><label for="b69039dc8767a7b19268d3134cd1eccd7a91f7de" class="margin-toggle"> ⊕</label><input type="checkbox" id="b69039dc8767a7b19268d3134cd1eccd7a91f7de" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Different samples in $\Omega$ are indeed distinct objects, but their difference does not matter in the context of event $e$.</span></span></span> w.r.t. $e$.</li>
</ul>
</li>
<li><strong>Probability measure</strong> <span class="marginnote-outer"><span class="marginnote-ref">$P : E \to [0, 1]$</span><label for="4ad9ab16708369ac059ab042be9927eff00636f9" class="margin-toggle"> ⊕</label><input type="checkbox" id="4ad9ab16708369ac059ab042be9927eff00636f9" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In general a measure $Q : E \to \real_{\geq 0}$, but I’m including the restriction of the co-domain to the unit interval $[0, 1]$ in the definition of <span class="kdmath">$P$</span>, because we are only talking about probability measures here, and there’s no reason to be more general.</span></span></span> is a function that maps allowed subsets of $\Omega$ to the real unit interval. $P$ is a <strong>measure</strong>, which means it satisfies certain properties that make it behave analogous to length, area, volume, etc. in Euclidean space. Essentially, a measure is a generalization of size that satisfies the following properties:
<ul>
<li><span class="advanced outer hidden"><span class="advanced inner hidden"><strong>Measurable domain</strong>: $E$ is a $\sigma$-algebra of measurable sets.</span></span></li>
<li><strong>Non-negativity</strong>: <span class="kdmath">$P(e) \geq 0,\ \forall e \in E$</span>.</li>
<li><strong>Null empty set</strong>: <span class="kdmath">$P(\emptyset) = 0$</span>.</li>
<li><strong>Unit sample set</strong>: <span class="kdmath">$P(\Omega) = 1$</span>.</li>
<li><strong>Countable additivity</strong>: For any countable set of events <span class="marginnote-outer"><span class="marginnote-ref"><span class="kdmath">$A \subseteq E$</span></span><label for="1c4c2f199fe96ef43d1f0b6cc492af4deb8af6af" class="margin-toggle"> ⊕</label><input type="checkbox" id="1c4c2f199fe96ef43d1f0b6cc492af4deb8af6af" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Remember that $E$ is the event space, and $A$ is a set of events.</span></span></span> where <span class="kdmath">$\bigcap A = \emptyset$</span>, <span class="kdmath">$P(\bigcup A) = \sum_{e\in A} P(e)$</span>.</li>
</ul>
</li>
</ul>
<p>The triple $(\Omega, E, P)$ defines a <a href="https://en.wikipedia.org/wiki/Probability_space">probability space</a> <span class="advanced outer hidden"><span class="advanced inner hidden">which is also a <a href="https://en.wikipedia.org/wiki/Measure_space">measure space</a>.</span></span> These three objects are all we need to do probability calculations.</p>
<h2 id="kolmogorov-axioms-of-probability"><a class="header-anchor" href="#kolmogorov-axioms-of-probability">Kolmogorov axioms of probability</a></h2>
<p>You may have heard of the <a href="https://en.wikipedia.org/wiki/Probability_axioms#Axioms">Kolmogorov axioms of probability</a>. Kolmogorov formalized probability as a special case of measure theory. Essentially a probability measure is a normalized measure, i.e. assigns 1 to the entire sample space $\Omega$. Above, I’ve merged the axioms of measure theory with Kolmogorov’s axioms. For reference, here are Kolmogorov’s axioms given separately:</p>
<ol>
<li>$P(e) \in [0, 1], \forall e \in E$, where $[0, 1] \subset \real$.</li>
<li>$P(\Omega) = 1$, i.e. probability of anything happening is 1.</li>
<li><span class="advanced outer hidden"><span class="advanced inner hidden"><a href="https://en.wikipedia.org/wiki/Sigma_additivity">$\sigma$-additivity</a> on $E$.</span></span></li>
</ol>
<p><span class="advanced outer hidden"><span class="advanced inner hidden">Given the axioms of measure theory, we can define probability succinctly by simply stating that $(\Omega, E, P)$ is a measure space where $P(\Omega) = 1$ (see <a href="https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf">Terence Tao’s Introduction to Measure Theory</a>).</span></span></p>
<h2 id="examples"><a class="header-anchor" href="#examples">Examples</a></h2>
<p><strong>Finite</strong>: Dice rolls</p>
<p><span class="kdmath">$\Omega = \{⚀,⚁,⚂,⚃,⚄,⚅\}$</span>,<br />
<span class="kdmath">$E=2^\Omega$</span>,<br />
<span class="kdmath">$P(\{⚀\})= P(\{⚁\}) = \ldots = P(\{⚅\}) =1/6$</span>.</p>
<p>Note that <span class="kdmath">$P(⚀)$</span> is not defined. $P$ measures the “size” of sets. <span class="kdmath">$\{⚀\}$</span> is the set containing one sample. We can also compute the probability of larger sets, e.g.<br />
<span class="kdmath">$P(\{⚀,⚅\}) = 1/3$</span>,<br />
<span class="kdmath">$P(\{⚁,⚃,⚅\}) = 1/2$</span>,<br />
<span class="kdmath">$P(\{⚀,⚁,⚂,⚃,⚄,⚅\}) = 1$</span>.</p>
<p><strong>Countable</strong> (event set): Variable length binary sequences</p>
<p><span class="kdmath">$\bin = \{0, 1\}$</span> is the binary alphabet.<br />
Let $x \in \bin^n$ be a binary sequence of any length $n$, and <span class="kdmath">$\len{x} := n$</span> returns the length of $x$.</p>
<p>The sample set is all infinite binary sequences, <span class="kdmath">$\Omega = \mathbb{B}^\infty$</span>.<br />
This let’s us make an event for each finite length $x$.<br />
Let <span class="kdmath">$\Gamma_x = \left\{\omega \in \Omega \bigmid x = \omega_{1:\len{x}}\right\}$</span>, where <span class="kdmath">$\omega_{1:\len{x}}$</span> is the length $\len{x}$ prefix of $\omega$.<br />
The event set is <span class="kdmath">$E=\left\{\Gamma_x \bigmid x \in \mathbb{B}^n, n \in \mathbb{N}\cup\{0\}\right\}$</span></p>
<p>Then <span class="kdmath">$P(\Gamma_x)$</span> is the probability of $x$, and <span class="kdmath">$P(\Gamma_{x_1} \cup \Gamma_{x_2} \cup \ldots)$</span> is the probability of the set <span class="kdmath">$\{x_1, x_2, \ldots\}$</span>.<br />
Note that the probability of a finite sequence is always a marginal probability, in the sense that <span class="kdmath">$P(\Gamma_x) = P(\Gamma_{x`0}) + P(\Gamma_{x`1})$</span> where <span class="kdmath">$x`0$</span> and <span class="kdmath">$x`1$</span> are the concatenations of <span class="kdmath">$x$</span> with 0 or 1.</p>
<p>An example of such a measure is the uniform measure, <span class="kdmath">$P(\Gamma_x) = 2^{-\len{x}}$</span>.</p>
<p><strong>Uncountable</strong>: The reals</p>
<p><span class="kdmath">$\Omega=\real$</span>,<br />
<span class="kdmath">$E \subset 2^\real$</span> contains sets of reals formed by countable union, intersection, and complement of the open intervals. <span class="advanced outer hidden"><span class="advanced inner hidden">This particular choice of $E$ is called the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel algebra</a>, and is a standard $\sigma$-algebra for $\real$. The reason we don’t use $E = 2^\real$ as our event space is that some subsets of $\real$ are not measurable.</span></span></p>
<p>We only need to define $P$ on single intervals, and because of additivity of probability we can derive $P$ on every set in $E$. <span class="advanced outer hidden"><span class="advanced inner hidden">A measure $P$ defined on intervals is called a <a href="https://en.wikipedia.org/wiki/Borel_measure#On_the_real_line">Borel measure</a>.</span></span> Let</p>
<div class="kdmath">$$
P((a,b]) = \int_a^b \frac{1}{\sqrt{2 \pi }} e^{-\frac{x^2}{2}} \d x\,.
$$</div>
<p>Note that it does not matter if we define $P$ on open intervals, closed intervals, or half-open intervals, because the value of the integral is identical between these cases. <span class="advanced outer hidden"><span class="advanced inner hidden">Specifically, we are performing a Lebesgue integral, which is invariant to removing a measure 0 subset from the integral domain. See the <a href="https://en.wikipedia.org/wiki/Lebesgue_integration#Basic_theorems_of_the_Lebesgue_integral">equality almost-everywhere</a> property.</span></span></p>
<p>In this particular example, $\frac{\d}{\d x} P((0, x])$ is the <a href="https://en.wikipedia.org/wiki/Normal_distribution">standard normal</a> (i.e. Gaussian) <a href="https://en.wikipedia.org/wiki/Probability_density_function">probability density function (PDF)</a>. It is common, when working with probability on the reals, to provide a PDF which can be integrated over to derive the <span class="marginnote-outer"><span class="marginnote-ref">probability measure</span><label for="4bf04e5f74a071ce2f80b754d2578481d612a33e" class="margin-toggle"> ⊕</label><input type="checkbox" id="4bf04e5f74a071ce2f80b754d2578481d612a33e" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The output of the probability measure is called <em>probability mass</em>, to distinguish it from the output of the PDF, which is called <em>probability density</em>.</span></span></span>. In other words, a PDF $f(x)$ is a function that when integrated produces a probability measure: $P((a, b]) = \int_a^b f(x) \d x$.</p>
<h2 id="pmfs-and-pdfs-and-measures-oh-my"><a class="header-anchor" href="#pmfs-and-pdfs-and-measures-oh-my">PMFs and PDFs and measures, oh my!</a></h2>
<p>In standard probability textbooks and courses (largely for non-theoreticians), you are told about probability mass functions (PMFs) and probability density functions (PDFs), and their cumulative counterparts: cumulative mass functions (CMFs) and cumulative density functions (CDFs). So you may be wondering where these fit into the definition of probability above. I’ve been talking about probability measures, and only mentioned PDF in the real line example above.</p>
<p>For finite and countable sample sets, PMFs, CMFs and measures are equivalent, meaning you can derive one from the others. We can convert between PMF $m : \Omega \to [0,1]$ and measure $P: E \to [0,1]$ with the following relations:</p>
<div class="kdmath">$$
\begin{aligned}
m(\omega) &= P(\{\omega\}) \\
P(e) &= \sum_{\omega\in e} m(\omega)\,.
\end{aligned}
$$</div>
<p>For differentiable continuous sample sets <span class="advanced outer hidden"><span class="advanced inner hidden">where $E$ is the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel algebra</a></span></span> (e.g. the reals), PDFs, CDFs and measures are equivalent, meaning you can derive one from the others. We can convert between PDF $f : \Omega \to \real$ and measure $P : E \to [0,1]$ with the following relations:</p>
<div class="kdmath">$$
\begin{aligned}
f(x) &= \frac{\d}{\d x} P((c, x]) \\
P((a, b]) &= \int_a^b f(x) \d x\,,
\end{aligned}
$$</div>
<p>for some constant $c\in\Omega$.</p>
<p>The measure-theoretic definition of probability unifies the discrete and continuous cases, and can handle exotic cases, e.g. non-differentiable uncountable sample sets.</p>
<h2 id="events-vs-samples"><a class="header-anchor" href="#events-vs-samples">Events vs samples</a></h2>
<p><strong>Question:</strong> Why provide event space $E$? Isn’t this redundant with $\Omega$?</p>
<p>You may be thinking that given just $\Omega$, we can define $P : 2^\Omega \to [0,1]$ which satisfies the properties of a measure listed earlier, and it is sufficient to define <span class="kdmath">$P(\{\omega\})$</span> for each $\omega \in \Omega$. That is true for countable $\Omega$ (e.g. the dice example above). The technical reason for basing probability theory on measure theory is that for uncountable $\Omega$, some subsets are not measurable. $E$ tells us which subsets of $\Omega$ are measurable, and are safe to compute the probability of. Perhaps the real reason is to simplify the definition of probability down to one constraint, $P(\Omega) = 1$. The apparent redundancy of $\Omega$ and $E$ is then inherited from measure theory. This kind of information redundancy <span class="marginnote-outer"><span class="marginnote-ref">in mathematical constructions is quite common</span><label for="fae4f9d36286bf3dcf9d3f2d43145ea41509157a" class="margin-toggle"> ⊕</label><input type="checkbox" id="fae4f9d36286bf3dcf9d3f2d43145ea41509157a" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For example, a group is defined as $(G, +)$ where $G$ is a set of objects and $+ : G \times G \to G$ is some binary operator defined over $G$. The definition of $+$ already includes $G$, so technically providing $G$ is not necessary. A group is defined as a tuple $(G, +)$ to distinguish it from the set $G$ and the operator $+$. Another example is a topological space defined as the tuple $(X, \tau)$ where $X$ is a set of objects and $\tau$ is a set of subsets of $X$ which contains $X$. Since $X = \bigcup \tau$ so we don’t need to provide $\tau$, but again we want to distinguish the topological space from $X$ and $\tau$ (where $\tau$ is just called the topology).</span></span></span>, and is merely a particular notational style. Redundancy is not a high cost to pay for notational clarity.</p>
<p><strong>Question:</strong> Why do I care about events containing multiple samples? Only one sample ever happens at a time.</p>
<ol>
<li>We want to be able to calculate the probability of “one or the other thing” happening. Let <span class="kdmath">$\omega_1, \omega_2 \in \Omega$</span>. <span class="kdmath">$\{\omega_1\}, \{\omega_2\} \in E$</span> are the events corresponding to exactly one thing happening. <span class="kdmath">$\{\omega_1, \omega_2\} \in E$</span> is the event corresponding to either <span class="kdmath">$\omega_1$</span> or <span class="kdmath">$\omega_2$</span> happening.</li>
<li>We want to be able to calculate the probability of something not happening. Not-<span class="kdmath">$\omega_1$</span> is the event <span class="kdmath">$\{\omega \in \Omega \mid \omega \neq \omega_1\}$</span>.</li>
</ol>
<p><strong>Question:</strong> But what about the probability of “one <strong>AND</strong> the other thing” happening?</p>
<p>Samples in <span class="kdmath">$\Omega$</span> each represent exactly one unique state of the world. To say the world is in state $\omega_i$ AND $\omega_j$ simultaneously is a contradiction, since each states on its own is complete, in the sense that they specify everything. However, it may be the case that world-state can be decomposed into two independent parts. Then your sample set is the cartesian product of sets for each independent sub-state, i.e. <span class="kdmath">$\Omega = \Lambda_1 \times \Lambda_2$</span> and <span class="kdmath">$\omega = (\lambda_1, \lambda_2) \in \Lambda_1 \times \Lambda_2$</span>. Thus each sample <span class="kdmath">$\omega$</span> already represents the “and” of two states if you want it to.</p>
<h1 id="constructing-events"><a class="header-anchor" href="#constructing-events">Constructing events</a></h1>
<p>A primitive event is a <span class="marginnote-outer"><span class="marginnote-ref">singleton set</span><label for="d110d830c28b4b739bdd1217694def459e015af9" class="margin-toggle"> ⊕</label><input type="checkbox" id="d110d830c28b4b739bdd1217694def459e015af9" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The set containing one sample, i.e. <span class="kdmath">$e =\{\omega\}$</span> where <span class="kdmath">$\omega \in \Omega$</span></span></span></span>. Events are <span class="marginnote-outer"><span class="marginnote-ref">what get observed, not samples</span><label for="0b2310df56ec7cd2cb22ca9570daf625df84da24" class="margin-toggle"> ⊕</label><input type="checkbox" id="0b2310df56ec7cd2cb22ca9570daf625df84da24" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">See the <a href="#throwing-darts">dart-throwing discussion</a> below for a good reason why this should be the case.</span></span></span>. If an event contains many samples, you don’t know which of them is the case, but only one can be the case since they are disjoint.</p>
<p>Probability theory has specialized notation that revolves around turning the “define my event and measure its probability” process into one concise notational step. Random variables (RV)s are central to this notation. But before introducing random variables, let’s look at how we would construct events and measure their probability without RVs:</p>
<ul>
<li><strong>Construct event:</strong> <span class="kdmath">$e = \{\omega \in \Omega \mid \mathrm{condition}(\omega)\}$</span>, where <span class="kdmath">$\mathrm{condition}(\omega)$</span> is some boolean valued proposition on $\omega$.</li>
<li><strong>Measure probability:</strong> $P(e)$. So long as <span class="kdmath">$e \in E$</span>, then <span class="kdmath">$P(e)$</span> is defined.</li>
</ul>
<p>Combined we have,</p>
<div class="kdmath">$$
P(\{\omega \in \Omega \mid \mathrm{condition}(\omega)\})\,.
$$</div>
<p>For example, if $\Omega = \nat$ and we wanted to compute the probability of getting an even number, then <span class="kdmath">$e = \{n \in \nat \mid \mathrm{Remainder}(n/2) = 0\}$</span> and <span class="kdmath">$P(\{n \in \nat \mid \mathrm{Remainder}(n/2) = 0\})$</span> is the probability.</p>
<h2 id="random-variables"><a class="header-anchor" href="#random-variables">Random variables</a></h2>
<p>Random variables are devices for constructing events. That is their purpose. Contrary to their name, there is <span class="marginnote-outer"><span class="marginnote-ref">nothing random about them.</span><label for="250337d20f972049cc956351c6be818ab040f059" class="margin-toggle"> ⊕</label><input type="checkbox" id="250337d20f972049cc956351c6be818ab040f059" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">A random variable is a deterministic function. The word <em><strong>random</strong></em> is due to it being a function of samples which are randomly chosen.</span></span></span></p>
<p>A random variable is a <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> function <span class="kdmath">$X : \Omega \to F$</span>, <span class="advanced outer hidden"><span class="advanced inner hidden">where $(F, \mathcal{F})$ is a <a href="https://en.wikipedia.org/wiki/Measurable_space">measurable space</a> with $\sigma$-algebra $\mathcal{F}$ (specifies measurable subsets of $F$),</span></span> and the elements of $F$ can be any type of object.</p>
<p>There are three main motivations for the random variable formalism…</p>
<h3 id="motivation-1-information-hiding"><a class="header-anchor" href="#motivation-1-information-hiding">Motivation 1: Information hiding</a></h3>
<p>I briefly mentioned <a href="#events-vs-samples">above</a> that samples (world state) can be treated as containing sub-samples (sub-state), e.g. $\omega = (\lambda_1, \lambda_2) \in \Lambda_1 \times \Lambda_2 = \Omega$. Random variables are convenient for dealing with just one sub-sample in isolation, and they allow you to avoid committing to a particular way to divide up $\omega$, e.g. $\omega = (\lambda_1, \lambda_2) = (\kappa_1, \kappa_2, \kappa_3)$ might be two different and incompatible but semantically meaningful ways to divide sample $\omega$ into sub-samples.</p>
<p>A random variable $X : \Omega \to F$ <em>hides information</em> contained in $\omega \in \Omega$ by appropriate choice of $F$. E.g. let $\Omega = \Lambda_1 \times \Lambda_2$ and let <span class="kdmath">$X_1 : \Omega \to \Lambda_1 : (\lambda_1, \lambda_2) \mapsto \lambda_1$</span> and <span class="kdmath">$X_2 : \Omega \to \Lambda_2 : (\lambda_1, \lambda_2) \mapsto \lambda_2$</span> be two random variables. $X_1(\Omega) = \Lambda_1$ and $X_2(\Omega)=\Lambda_2$ are smaller sample spaces than $\Omega$, each which hide sub-samples.</p>
<p>When multiple random variables are invoked in the same context, they are assumed to be <span class="marginnote-outer"><span class="marginnote-ref">over the same sample space $\Omega$.</span><label for="2d1ff964981aff5411e5b2e1dc946fe1bd3dfccd" class="margin-toggle"> ⊕</label><input type="checkbox" id="2d1ff964981aff5411e5b2e1dc946fe1bd3dfccd" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For RVs $X_1, X_2, \ldots$ it is assumed there is a joint probability distribution <span class="kdmath">$P_{X_1, X_2, \ldots}$</span>. See the definition of joint distribution <a href="#probability-distribution-of-a-random-variable">below</a>.</span></span></span></p>
<h4 id="examples-1"><a class="header-anchor" href="#examples-1">Examples</a></h4>
<p><strong>Toss two coins</strong></p>
<p><span class="kdmath">$\Omega = \Lambda_1 \times \Lambda_2$</span>. <span class="kdmath">$(\lambda_1, \lambda_2) \in \Omega$</span>. <span class="kdmath">$\Lambda_1 = \Lambda_2 = \{H, T\}$</span>. <br />
Define <span class="kdmath">$X_1 : (\lambda_1, \lambda_2) \mapsto \lambda_1$</span> and <span class="kdmath">$X_2 : (\lambda_1, \lambda_2) \mapsto \lambda_2$</span>.<br />
<span class="kdmath">$X_1$</span> isolates the state of the first coin. <span class="kdmath">$X_2$</span> isolates the state of the second coin.<br />
$P(X_1=H) = P(\{\omega \in \Omega \mid X_1(\omega) = H\}) = P(\{(H,H), (H,T)\})$</p>
<p><strong>Toss two dice</strong></p>
<p><span class="kdmath">$\Omega = \Lambda_1 \times \Lambda_2$</span>. <span class="kdmath">$(\lambda_1, \lambda_2) \in \Omega$</span>. <span class="kdmath">$\Lambda_1 = \Lambda_2 = \{1,2,3,4,5,6\}$</span>. <br />
Define <span class="kdmath">$S : (\lambda_1, \lambda_2) \mapsto \lambda_1 + \lambda_2$</span>.<br />
<span class="kdmath">$S$</span> returns the sum of the two die outcomes. <br />
The codomain of <span class="kdmath">$S$</span> is <span class="kdmath">$\{2, 3, \ldots, 11, 12\}$</span><br />
<span class="kdmath">$P(S=4) = P(\{\omega \in \Omega \mid S(\omega) = 4\}) = P(\{(1,3), (2,2), (3, 1)\})$</span></p>
<p><strong>In the general case…</strong></p>
<p>we might want to represent any number of interacting observables and components in a system. How about modeling the weather or the stock market? Your primitive sample space might be astronomical, but you can identify all sorts of observables like the prices of AAPL and GOOG at time <span class="kdmath">$t$</span> or the temperatures of Florida and Vermont on Tuesday, which would be convenient to deal with separately. At the same time, you don’t want to lose the rich information about how one particular observable interacts with all the others. We would like to be able to ignore partial information contained in primitive samples (i.e. <a href="#probability-distribution-of-a-random-variable">marginalize</a>).</p>
<h3 id="motivation-2-syntactic-sugar"><a class="header-anchor" href="#motivation-2-syntactic-sugar">Motivation 2: Syntactic sugar</a></h3>
<p>We’ve seen how events can be constructed with set builder notation, i.e. <span class="kdmath">$e = \{\omega \in \Omega \mid \mathrm{condition}(\omega)\}$</span>, and we’ve seen how a random variable $X : \Omega \to F$ can be used to build events, e.g. <span class="kdmath">$e = \{\omega \in \Omega \mid X(\omega) = f\}$</span> where $f \in F$ is some object.</p>
<p>There is a shorthand notation for writing <span class="kdmath">$P(\{\omega \in \Omega \mid X(\omega) = f\})$</span>, which is</p>
<div class="kdmath">$$
P(X=f)\,.
$$</div>
<p>The general case of this notation is</p>
<div class="kdmath">$$
\begin{align}
& P(\mathrm{condition}(X_1, X_2, \ldots)) \\
& \quad = P(\{\omega \in \Omega : \mathrm{condition}(X_1(\omega), X_2(\omega), \ldots)\})\,,
\end{align}
$$</div>
<p>where $X_1 : \Omega \to F_1,\ \ X_2 : \Omega \to F_2, \ \ \ldots$ are random variables, and <span class="kdmath">$\mathrm{condition}(f_1, f_2, \ldots)$</span> is some boolean function of inputs <span class="kdmath">$f_1 \in F_1, f_2 \in F_2, \ldots$</span> <span class="advanced outer hidden"><span class="advanced inner hidden">with measurable spaces $(F_1, \mathcal{F}_1), (F_2, \mathcal{F}_2), \ldots$</span></span></p>
<p><strong>Examples:</strong></p>
<ul>
<li><span class="kdmath">$P(X = Y) = P(\{\omega \in \Omega \mid X(\omega) = Y(\omega)\})$</span>, where <span class="kdmath">$Y : \Omega \to F$</span> is a random variable.</li>
<li><span class="kdmath">$P(X=f, Y=g) = P(\{\omega \in \Omega \mid X(\omega)=f, Y(\omega)=g\})$</span> where <span class="kdmath">$Y:\Omega \to G$</span> and <span class="kdmath">$g \in G$</span>.</li>
<li><span class="kdmath">$P(X \in A) = P(\{\omega \in \Omega \mid X(\omega) \in A\})$</span>, for <span class="kdmath">$A \subseteq F$</span> <span class="advanced outer hidden"><span class="advanced inner hidden">(and $A \in \mathcal{F}$ is measurable).</span></span></li>
<li>$P(X > f) = P(\{\omega \in \Omega \mid X(\omega) > f\})$.</li>
<li>$P(X > Y) = P(\{\omega \in \Omega \mid X(\omega) > Y(\omega)\})$.</li>
<li>Arbitrary algebraic expressions of random variables, e.g. <span class="kdmath">$P(c_0 + c_1 X + c_2 X^2 + c_3 X^3 + \ldots = k) = P(\{\omega \in \Omega \mid c_0 + c_1 X(\omega) + c_2 X(\omega)^2 + c_3 X(\omega)^3 + \ldots = k\})$</span> or <span class="kdmath">$P(\exp(X) = \log(Y)) = P(\{\omega \in \Omega \mid \exp(X(\omega)) = \log(Y(\omega))\})$</span>.</li>
</ul>
<p>A standard notational convention is that calling a function on a random variable generates a new random variable, i.e. <span class="kdmath">$h(X) = h∘X$</span>, so that <span class="kdmath">$P(h(X) = c)$</span> can be parsed either as <span class="kdmath">$P(Y = c)$</span> where random variable <span class="kdmath">$Y = h∘X$</span>, or as <span class="kdmath">$P(\mathrm{condition}(X))$</span> where <span class="kdmath">$\mathrm{condition}(x)$</span> is the expression <span class="kdmath">$h(x) = c$</span>.</p>
<h4 id="probability-distribution-of-a-random-variable"><a class="header-anchor" href="#probability-distribution-of-a-random-variable">Probability distribution of a random variable</a></h4>
<p>Any random variable $X : \Omega \to F$ <span class="advanced outer hidden"><span class="advanced inner hidden">to measurable space $(F, \mathcal{F})$</span></span> induces a unique probability measure with $F$ as the sample set, rather than $\Omega$. We call it the <strong>marginal distribution</strong> w.r.t. $X$, defined as <span class="kdmath">$P_X: F \to [0, 1]$</span>:</p>
<div class="kdmath">$$
P_X(A) := P(X \in A) = P(\{\omega \in \Omega \mid X(\omega) \in A\})\,,
$$</div>
<p>for <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> $A \subseteq F$. Thus $(F, \mathcal{F}, P_X)$ is the probability space for the marginal distribution of $X$. Note that <span class="kdmath">$P(X=f) = P_X(\{f\})$</span>, <span class="kdmath">$P(X < f) = P_X(\{f' \in F \mid f' < f\})$</span>, etc.</p>
<p>We often have more than one random variable of interest. With $X$ defined above and $Y : \Omega \to G$ <span class="advanced outer hidden"><span class="advanced inner hidden">to measurable space $(G, \mathcal{G})$</span></span>, we have the marginal distributions $P_X$ and $P_Y$, and also the <strong>joint distribution</strong> w.r.t. $X$ and $Y$, defined as $P_{X,Y} : F \times G \to [0, 1]$:</p>
<div class="kdmath">$$
P_{X,Y}(A, B) := P(X \in A \wedge Y \in B) = P(\{\omega \in \Omega \mid X(\omega) \in A \wedge Y(\omega) \in B\})
$$</div>
<p>for <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> $A \subseteq F, B \subseteq G$. Thus <span class="marginnote-outer"><span class="marginnote-ref">$(F \times G, \mathcal{F} \otimes \mathcal{G}, P_{X,Y})$</span><label for="0a155670fb530d756f4f07e993143bde44206b9e" class="margin-toggle"> ⊕</label><input type="checkbox" id="0a155670fb530d756f4f07e993143bde44206b9e" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">$\mathcal{F} \otimes \mathcal{G} := \{A \times B \mid A \in \mathcal{F}, B \in \mathcal{G}\}$.</span></span></span> is the probability space for the joint distribution of $X$ and $Y$.</p>
<p>In general, for RVs $X_1 : \Omega \to F_1,\ \ X_2 : \Omega \to F_2,\ \ \ldots$, we have the joint distribution $P_{X_1,X_2,\ldots} : F_1 \times F_2 \times \ldots \to [0, 1]$:</p>
<div class="kdmath">$$
P_{X_1,X_2,\ldots}(A_1, A_2, \ldots) := P(X_1 \in A_1 \wedge X_2 \in A_2 \wedge \ldots)\,.
$$</div>
<p>A joint distribution may also be a marginal distribution. For example, if I have RVs $X_1, \ldots, X_{10}$ and I consider the probability measure $P_{X_3,X_5,X_7}$.</p>
<p>RVs in a joint distribution need not be created from cartesian products of sample sets, i.e. the output of one RV may partially determine the output of another. Taking the two dice example, my space is <span class="kdmath">$\Omega = \{1, \ldots, 6\} \times \{1, \ldots, 6\}$</span>. The random variable for the outcome of die 1 is $D_1((n, m)) \mapsto n$, and the random variable for the sum of dice is $S((n, m)) \mapsto n + m$. Choosing $\omega \in \Omega$ to determine $D_1$ may also determine $S$, and vice versa. If I want $S(\omega) = 2$ then $\omega = (1, 1)$ and $D_1(\omega) = 1$ is fully determined. Likewise if we choose $\omega$ so that $D_1(\omega) = 6$ then the possible values of $S(\omega)$ are restricted to $7, 8, 9, 10, 11, 12$. Nevertheless, $P_{D_1, S}$ is a perfectly fine joint distribution.</p>
<p>Keeping track of all these probability functions can be confusing, e.g. marginals $P_X$ and $P_Y$ and joint $P_{X,Y}$ are in a sense derived from a single probability function $P$, where $P(X=x)$ and $P(Y=y)$ are equivalent to <span class="kdmath">$P_X(\{x\})$</span> and <span class="kdmath">$P_Y(\{y\})$</span>. However, it is possible to have two different underlying probability measures that reuse the same random variables, e.g. $Q : \Omega \to [0, 1]$ with expressions like $Q(X=x)$ and $Q(Y=y)$ being possible, and marginals $Q_X$ and $Q_Y$ and joint $Q_{X,Y}$. Keep in mind that calculations with $P$-related and $Q$-related probability functions do not necessarily have anything to do with each other.</p>
<h4 id="notational-confusion"><a class="header-anchor" href="#notational-confusion">Notational confusion</a></h4>
<p>The language of probability may seem simple enough, but notationally it can be quite cumbersome. When it comes to applications in statistics, machine learning and physics just to name a few, there can be a large quantity of random variables and complicated probability distributions. Authors of academic texts tend to take shortcuts for ease of readability, but they pay the price of ambiguity, which especially hurts readers who are not already familiar with the domain. This is not the fault of authors, but a symptom of clunky notation. I will outline a few common shortcuts and notational difficulties. I hope to write a separate post delving deeper into examples where ambiguity occurs in the wild and how to avoid it.</p>
<p>Generally in texts there is often ambiguity between PMFs, PDFs and measures, and between samples, events, and random variables.</p>
<p>For example, you may see any of $P(X), p(X), P(x)$ or $p(x)$, where it is not make clear whether $P$ or $p$ is a measure or a PMF/PDF, and whether $X$ or $x$ is a sample, event, or random variable. There is no universal convention on uppercase vs lowercase. Uppercase $X$ can mean a vector or matrix in a lot of contexts, as well as bold $\boldsymbol{X}$. Same situation for marginals, e.g. $p_X(x)$ is common.</p>
<p>When there are many random variables to juggle, you may see different ways to denote marginal distributions, e.g. $P(X,Y)$ and $P_{X,Y}$. This becomes important when you want to do algebra with probability, e.g.</p>
<ol>
<li>$P(W) = P(X, Y, Z)/Q(Y,Z)$</li>
<li>$P_W = P_{X, Y,Z}/Q_{Y,Z}$</li>
<li>$P(W=w) = P(X=f(w), Y=g(w), Z=h(w))/Q(Y=g(w),Z=h(w))$</li>
</ol>
<p>The problem with the first case is that it depends on position for variable identity, but the reader expects identity by name, i.e. $P(X, Y, Z)$ is intended to be the same as $P(Y, Z, X)$. The second case fixes this problem because it cleanly separates the meaning of each argument from its value, e.g. $P_{X,Y,Z}(Z,X,Y)$ reads “plug in $Z$ for $X$, $X$ for $Y$, and $Y$ for $Z$.” The last case is equivalent to the second, and much like <a href="https://www.w3schools.com/python/gloss_python_function_keyword_arguments.asp">keyword argument syntax in Python</a>, but with the benefit of being notationally primitive rather than relying on the <em>function factory</em> convention $f_{X_1,X_2,X_3,\ldots}(x_1, x_2, x_3, \ldots) = f(X_1=x_1, X_2=x_2, X_3=x_3,\ldots)$.</p>
<p>It is worth noting that probability notation can be used correctly without too much trouble. Theoretical statistics and mathematics texts tend to have good examples of correct usage.</p>
<h3 id="motivation-3-construct-events-that-are-guaranteed-measurable"><a class="header-anchor" href="#motivation-3-construct-events-that-are-guaranteed-measurable">Motivation 3: Construct events that are guaranteed measurable</a></h3>
<p>Using random variable $X : \Omega \to F$ inside set-builder notation will guarantee that the result is an event, i.e. an element of $E$. For example, <span class="kdmath">$\{\omega \in \Omega \mid X(\omega) \in A\} \in E$</span> as long as $X^{-1}(A) \in E$. We specified in the definition of random variable that it be a <em>measurable</em> function, which is a fancy way of saying that we restrict ourselves to such $A \subseteq F$ where $X^{-1}(A) \in E$ holds.</p>
<p><span class="advanced outer hidden"><span class="advanced inner hidden">
The definition of random variable specifies that the function $X : \Omega \to F$ is <em>measurable</em>. That means for measurable spaces $(\Omega, E)$ and $(F, \mathcal{F})$, it is the case that <span class="marginnote-outer"><span class="marginnote-ref"><span class="kdmath">$X^{-1}(A) \in E,\ \forall A \in \mathcal{F}$</span>.</span><label for="f5bfb7f5110efa973669d06b6cf8443929676dd1" class="margin-toggle"> ⊕</label><input type="checkbox" id="f5bfb7f5110efa973669d06b6cf8443929676dd1" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Where <span class="kdmath">$X^{-1}(A) = \{\omega \in \Omega \mid X(\omega) \in A\}$</span> is the pre-image of <span class="kdmath">$X$</span> on <span class="kdmath">$A$</span>.</span></span></span> In other words, $X$ <span class="marginnote-outer"><span class="marginnote-ref">never maps a non-measurable subset</span><label for="09f5e06fa264194f74f5ebadcfbdeb1baba486aa" class="margin-toggle"> ⊕</label><input type="checkbox" id="09f5e06fa264194f74f5ebadcfbdeb1baba486aa" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">However, $X$ could map a measurable subset of $\Omega$ to a non-measurable subset of $F$.</span></span></span> of $\Omega$ to a measurable subset of $F$. Thus every set of the form <span class="kdmath">$\{\omega \in \Omega \mid X(\omega) \in A\} = X^{-1}(A)$</span> for measurable <span class="kdmath">$A \in \mathcal{F}$</span> is guaranteed to be measurable.</span></span></p>
<p><span class="advanced outer hidden"><span class="advanced inner hidden"><strong>Question:</strong> Are arbitrary expressions of random variables, i.e. $\mathrm{condition}(X_1, X_2, \ldots)$, guaranteed measurable?</span></span></p>
<h1 id="almost-surely"><a class="header-anchor" href="#almost-surely">Almost surely</a></h1>
<p>We know that $P(\emptyset) = 0$. It is possible (and common) to have non-empty events which have probability zero. Since we are calling $P$ a <em>measure</em> of probability (analogous to the size of a set), then we say that a set $e$ where $P(e) = 0$ has measure 0. Such an event is said to occur <strong>almost never</strong>.</p>
<p>We also know that $P(\Omega) = 1$. In the situations where non-empty sets have measure 0, there must be non-$\Omega$ sets with measure 1, because of the additivity of probability measure. Such sets are said to have measure 1, and such events are said to occur <a href="https://en.wikipedia.org/wiki/Almost_surely"><strong>almost surely</strong></a>.</p>
<p>There is nothing strange about non-empty sets of measure 0. Probability measure is not measuring the number of samples in an event (that would be set cardinality). If $P(e) = 0$, then for any sub-event $e’ \subset e$ we have $P(e’) = 0$ by additivity of probability measure. So if $\omega \in e$, then <span class="kdmath">$P(\{\omega\}) = 0$</span>. We could say informally that sample $\omega$ <span class="marginnote-outer"><span class="marginnote-ref">has</span><label for="8b6a113f6e09785a059e28fd0bd407e1177231b2" class="margin-toggle"> ⊕</label><input type="checkbox" id="8b6a113f6e09785a059e28fd0bd407e1177231b2" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">While recognizing that formally samples don’t have probability, and it is the event <span class="kdmath">$\{\omega\}$</span> which has probability 0.</span></span></span> 0 probability.</p>
<p><strong>Question:</strong> What does <span class="kdmath">$P(\{\omega\}) = 0$</span> imply about $\omega$? Does it mean that $\omega$ can never be the case, i.e. can never be a state of the world?</p>
<p>This is a question about the interpretation of probability, i.e. how probability theory interfaces with reality, and there is no universally agreed upon answer. The mathematical construction of probability theory is agnostic on the matter.</p>
<p>I think there are two follow up questions that naturally fall out of the original:</p>
<ol>
<li>For what reason would we define a probability measure $P$ such that <span class="kdmath">$P(\{\omega\}) = 0$</span> for some $\omega \in \Omega$?</li>
<li>If we are told $P$ describes some physical process and <span class="kdmath">$P(\{\omega\}) = 0$</span>, what will we observe?</li>
</ol>
<p>Naive answers to both are that we may assign measure 0 to events which can never be observed to occur, and if we believe an event has measure 0 then we will never observe it occurring. There are some who will say that nothing is impossible, merely improbable, and all events should be assigned non-zero probability. Clearly “no confirmation ⟹ impossible” is the <span class="marginnote-outer"><span class="marginnote-ref"><a href="https://en.wikipedia.org/wiki/Black_swan_theory">black swan fallacy</a>,</span><label for="b1edfaa9f6795859151d1b1f2a83d8d9aa8f7daa" class="margin-toggle"> ⊕</label><input type="checkbox" id="b1edfaa9f6795859151d1b1f2a83d8d9aa8f7daa" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Black swans were presumed to not exist by Europeans before the 16th century because only white swans had been observed. “However, in 1697, Dutch explorers led by Willem de Vlamingh became the first Europeans to see black swans, in Western Australia.” The fallacy is that lack of confirmation of something being true does not rule out the possibility that it is true. This fallacy amounts to mistaking ‘I have not found $x$ s.t. $\mathrm{proposition}(x)$’ with ‘$\not\exists x$ s.t. $\mathrm{proposition}(x)$’.</span></span></span>. You cannot know something is impossible by lack of observation, so you should not assign 0 probability because of lack of data. However, something may be logically impossible, or you may know something is impossible via other means.</p>
<p>Question #1 is a special case of the <a href="https://en.wikipedia.org/wiki/Inverse_probability">inverse probability problem</a>, which is the problem of determining the probability measure (distribution) that best describes some physical process (e.g. a game, physical experiment, stock market). Is there a 1-to-1 mapping between physical processes and probability distributions? In other words, is the distribution that best describes a physical process objective and unique, i.e. <span class="marginnote-outer"><span class="marginnote-ref">independently verifiable.</span><label for="2bc94f5ae10cf6a370cb757081fe65bb96d517c0" class="margin-toggle"> ⊕</label><input type="checkbox" id="2bc94f5ae10cf6a370cb757081fe65bb96d517c0" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In the same way that scientific experiments can be reproduced and verified by independent parties. If the reason for selecting measure $P_1$ over measure $P_2$ to describe a physical process is not dogmatic, then that choice should be independently arrived at from first principles by multiple parties.</span></span></span></p>
<p>There is at this time no good answer to the inverse probability problem. Kolmogorov developed his definition of probability to match the mathematical intuitions on probability of his predecessors going back to the <span class="marginnote-outer"><span class="marginnote-ref">17th century.</span><label for="47efecfe1101c6112ef47dad13c7a5d56659bc89" class="margin-toggle"> ⊕</label><input type="checkbox" id="47efecfe1101c6112ef47dad13c7a5d56659bc89" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Famously the <a href="https://en.wikipedia.org/wiki/Problem_of_points">problem of points</a> is an example of early probability calculation.</span></span></span> But what gave rise to this persistent intuition that the whole world should be described with probability, and that probability values should represent randomness and unpredictability? That I do not have an answer to, but I found Ian Hacking’s <a href="https://en.wikipedia.org/wiki/The_Emergence_of_Probability">The Emergence of Probability</a> to give a good account of the historical emergence of probability theory.</p>
<p>Not only is probability theory agnostic on the meaning of 0 probability, it doesn’t actually have anything to say about what it means for an outcome to be likely or unlikely, or expected or unexpected in the colloquial sense, at least not in a non-circular way. If we observe a 100 coin tosses all come up heads, I might say it was a fair coin and the tosser just got lucky/unlucky, and you might say the coin tosses were rigged and the probability of this outcome was clearly close to 1. Whose to say which probabilistic description of the physical setup is correct, unless there is some theory to tell us what probability distributions describe what physical systems, and thus what experiment we could do to see <span class="marginnote-outer"><span class="marginnote-ref">who is correct</span><label for="d98b8bb83b2ec1cfb46d3d878644310d083fb613" class="margin-toggle"> ⊕</label><input type="checkbox" id="d98b8bb83b2ec1cfb46d3d878644310d083fb613" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We do hold a lot of intuitions about this correspondence between the physical realm and probability. For example, symmetries should correspond to equiprobable outcomes. Most people will agree that if the coin were asymmetric in some way that could be a cause for it to come up one way more often. But how much more often? This is where things get fuzzy. In general, how do you determine the precise probability of heads from a model of coin tossing?</span></span></span>. This is out of scope of probability theory. Kolmogorov’s axioms merely ensure that probability is self-consistent within the realm of mathematics.</p>
<p>Kolmogorov himself tried to fix this shortcoming which led to the development of <a href="http://www.scholarpedia.org/article/Algorithmic_information_theory">algorithmic information theory</a>. In <a href="https://www.sciencedirect.com/science/article/pii/S0304397598000759?via%3Dihub">On tables of random numbers</a> he writes:</p>
<blockquote>
<p>… for a long time I had the following views:<br />
(1) The frequency concept based on the notion of limiting frequency as the number of trials increases to infinity, does not contribute anything to substantiate the applicability of the results of probability theory to real practical problems where we have always to deal with a finite number of trials.<br />
(2) The frequency concept applied to a large but finite number of trials does not admit a rigorous formal exposition within the framework of pure mathematics.</p>
</blockquote>
<h2 id="throwing-darts"><a class="header-anchor" href="#throwing-darts">Throwing darts</a></h2>
<p><a href="#examples">Above</a> I gave the reals as an example of a sample set. It is not hard to show that <a href="https://proofwiki.org/wiki/Countable_Sets_Have_Measure_Zero">every countable subset of the reals must have measure 0</a>. This gives rise to the classic conundrum that any particular number sampled from the real line (under, say, a Gaussian pdf) will have 0 probability of occurring. Or <span class="marginnote-outer"><span class="marginnote-ref">more poetically</span><label for="2b21c5bb409b3889130ada4524bd9d8231827510" class="margin-toggle"> ⊕</label><input type="checkbox" id="2b21c5bb409b3889130ada4524bd9d8231827510" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">This is just the same thought experiment but in $\real^2$.</span></span></span>, throw a dart at a dart board, and wherever it lands there is 0 probability of it doing so.</p>
<p>My response is two-fold. In the case of the dart board, since we are invoking a physical process, I argue that there are only finitely many distinguishable places the dart can land, limited by the precision of our measurement apparatus (e.g. a camera). I assert that we can only ever have finite precision on measurements (see my <a href="http://zhat.io/articles/primer-shannon-information#proof-that-mi-is-fininte-for-continuous-distributions">discussion on mutual information</a>). For this reason, event sets for physical processes are functionally finite, even if the sample set is infinite.</p>
<p>Probability theory gives us an elegant way to model a physical process with continuous state while simulating measurements of finite precision. This brings me to the real line example. Assuming we have a probability density function with <a href="https://en.wikipedia.org/wiki/Support_(mathematics)">support everywhere</a>, for both the dart board and real line, the measure of intervals that are not just points will be non-zero, because such intervals are uncountable sets. So choosing event intervals which correspond to measurement error bounds will produce events with non-zero probability. In short, you are taking the probability of a physical measurement outcome, not a <span class="marginnote-outer"><span class="marginnote-ref">state of the world!</span><label for="ccb568173abda2dfc430f6520595779078f92bbc" class="margin-toggle"> ⊕</label><input type="checkbox" id="ccb568173abda2dfc430f6520595779078f92bbc" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We could say states of the world are not directly accessible, but are only indirectly observable through finite measurement precision.</span></span></span> <span class="marginnote-outer"><span class="marginnote-ref">Singleton events</span><label for="707da6d71c44e8f065de71fff0ac04b56c5c26e2" class="margin-toggle"> ⊕</label><input type="checkbox" id="707da6d71c44e8f065de71fff0ac04b56c5c26e2" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Really any event containing finite or countably many samples in a sense is an infinite precision measurement, and conveys infinite information.</span></span></span> on $\real$ have essentially infinite precision, and you are in a sense <span class="marginnote-outer"><span class="marginnote-ref">“paying for” more precision</span><label for="502417c6d8b295ed75b255687e16dbd4f8d0cea7" class="margin-toggle"> ⊕</label><input type="checkbox" id="502417c6d8b295ed75b255687e16dbd4f8d0cea7" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">There is a direct connection between precision and information. More precision means more bits. Infinite precision means infinite information, and 0 probability. This is why the <a href="http://zhat.io/articles/primer-shannon-information#shannon-information-for-continuous-distributions">entropy of most distributions on $\real$ is infinite</a>.</span></span></span> in your events with increasingly small probabilities. At the limit, you pay for infinite precision with 0 probability.</p>
<h2 id="borels-law-of-large-numbers"><a class="header-anchor" href="#borels-law-of-large-numbers">Borel’s law of large numbers</a></h2>
<p>A classical interpretation of probability is that it represents the frequency of occurrence of some event in a repeatable process as the number of repetitions goes to infinity. This is sometimes called the <strong>frequentist</strong> interpretation of probability.</p>
<p><em>Repeatable</em>, in the language of probability theory, means <strong>independently and identically distributed</strong> (i.i.d.). That is, for RVs $X_1, X_2, \ldots$ their marginals are equal, $P_{X_1} = P_{X_2} = \ldots$ (i.e. identical), and their joint distribution is the product of marginals, $P_{X_1, X_2, \ldots}(A) = P_{X_1}(A)\cdot P_{X_2}(A) \cdot \ldots$ (i.e. <a href="https://en.wikipedia.org/wiki/Independence_(probability_theory)">independent</a>).</p>
<p>We have two problems:</p>
<ol>
<li>What does it mean for a physical process to be i.i.d.?</li>
<li>What does it mean to draw from a probability distribution more than once?</li>
</ol>
<p>The first is an open question. E.T. Jaynes in his <a href="https://www.cambridge.org/core/books/probability-theory/9CA08E224FF30123304E6D8935CF1A99">Logic of Science</a> argues that i.i.d. is never a reasonable description of physical systems:</p>
<blockquote>
<p>Such a belief is almost never justified, even for the fairly well-controlled measurements of the physicist or engineer, not only because of unknown systematic error, but because successive measurements lack the logical independence required for these limit theorems to apply.</p>
</blockquote>
<p>Consider two coin tosses. What makes them independent outcomes? We have an intuition that they are not causally connected and therefor they don’t share information, i.e. you cannot predict the outcome of one coin any better given the outcome of the other. There is a sort of paradox at the heart of probability theory, where an event with probability between 0 and 1 necessarily implies lack of understanding of the process behind that event. If you knew completely how a process gives rise to any particular outcome, then you could just <span class="marginnote-outer"><span class="marginnote-ref">model that process without probability</span><label for="292fc0dd509009c0a60ec63bb6c57ba411f69970" class="margin-toggle"> ⊕</label><input type="checkbox" id="292fc0dd509009c0a60ec63bb6c57ba411f69970" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For example, these papers modeling coin tossing:<br />‣ <a href="https://statweb.stanford.edu/~susan/papers/headswithJ.pdf">DYNAMICAL BIAS IN THE COIN TOSS</a><br />‣ <a href="https://arxiv.org/pdf/1008.4559.pdf">Probability, geometry, and dynamics in the toss of a thick coin</a><br />which move the probabilistic component of the model onto the initial conditions.</span></span></span>. So then, any model of the two coins that demonstrates why they do not share information would need to reveal their inner workings, thus going inside the physical black box delineated by probability. To understand why they are independent is to make their outcomes determined from a physicist’s “god-like perspective”, and in a sense non-probabilistic.</p>
<p>Regardless of the physical reality of i.i.d. processes, there is the mathematical question of how to represent i.i.d. repetitions of an experiment. Given $(\Omega, E, P)$ for our experiment and identity RV $X : \omega \mapsto \omega$, we can derive a larger distribution representing $n$ trials by taking the cartesian product of the sample space $n$ times, i.e. our probability space is $(\Omega_n, E_n, P_n)$ where</p>
<div class="kdmath">$$
\begin{align}
\Omega_n &:= \underbrace{\Omega \times \Omega \times \ldots \times \Omega}_{n\ \mathrm{times}} \\
E_n &:= \underbrace{E \otimes E \otimes \ldots E}_{n\ \mathrm{times}} \\
P_n &: (e_1, \ldots, e_n) \mapsto \prod_{i=1}^n P(e_i)\,.
\end{align}
$$</div>
<p>Ignoring the mathematical difficulties involved, let’s invoke the sample set over infinite trials, $\Omega_\infty$. Let’s also create a random variable for the outcome of each trial <span class="kdmath">$t \in \nat\setminus\{0\}$</span> in the infinite series:</p>
<div class="kdmath">$$
X_t : \Omega_\infty \to \Omega : (\omega_1, \omega_2, \ldots, \omega_t, \ldots) \mapsto \omega_t\,.
$$</div>
<p>The idea of probability representing the outcome frequency of infinite i.i.d. trials is formally captured by <span class="marginnote-outer"><span class="marginnote-ref"><a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law">Borel’s law of large numbers (BLLN)</a></span><label for="f8c0e2c3419ca4b7592f9b45eebbe3fc48accdc3" class="margin-toggle"> ⊕</label><input type="checkbox" id="f8c0e2c3419ca4b7592f9b45eebbe3fc48accdc3" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">This is a special case of the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law">strong law of large numbers</a>. There are a few variants of the law of large numbers (LLN), e.g. <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Weak_law">weak law</a>, but I feel BLLN most straightforwardly expresses the insight I wish to convey.</span></span></span>. Given any single-trial event $e \in E$, we have:</p>
<p><span class="kdmath">$P_\infty\left(\left\{\omega_\infty \in \Omega_\infty \bigmid \lim_{n \to \infty} \frac{1}{n} \sum\limits_{i=1}^n 𝟙[X_i(\omega_\infty) \in e] = P(e)\right\}\right) = 1\,,$</span><br />
where $𝟙[\mathrm{expr}]$ casts boolean $\mathrm{expr}$ to an integer (1 if true, 0 otherwise). The sum</p>
<div class="kdmath">$$
\sum\limits_{i=1}^n 𝟙[X_i(\omega_\infty) \in e]
$$</div>
<p>computes a count: the number of times event $e$ occurs in the first $n$ trials, where $\omega_\infty$ is the infinite sequence of trial samples. Dividing by $n$ gives the frequency, i.e. fraction of times $e$ appears out of the first $n$ trials.</p>
<p>Borel’s law of large numbers (BLLN) can then be written more concisely using our fun RV notation:</p>
<div class="kdmath">$$
P_\infty\left(\lim_{n \to \infty} \frac{1}{n} \sum\limits_{i=1}^n 𝟙[X_i \in e] = P(e)\right) = 1\,,
$$</div>
<p>or using <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables#Almost_sure_convergence">almost sure convergence notation</a>:</p>
<div class="kdmath">$$
\frac{1}{n} \sum\limits_{i=1}^n 𝟙[X_i \in e] \overset{\mathrm{a.s.}}{\longrightarrow} P(e)\,,
$$</div>
<p>though the latter does not make it clear that $P_\infty$ is our measure.</p>
<p>This equation is very intriguing, as it directly relates samples from $P_\infty$ to measure $P$. In short, BLLN states that there is a measure 1 set of infinite sequences of i.i.d. trials s.t. the limiting number of occurrences of event $e \in E$ as a fraction of the total number of trails is exactly $P(e)$. The implication is that <span class="marginnote-outer"><span class="marginnote-ref">almost surely</span><label for="808c2858de3d6d2f40c660f42492fb70f8369082" class="margin-toggle"> ⊕</label><input type="checkbox" id="808c2858de3d6d2f40c660f42492fb70f8369082" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For a measure 1 subset of samples in $\Omega_\infty$, of which each sample is itself an infinite sequence of single-trial samples.</span></span></span> we can infer $P$ from <span class="marginnote-outer"><span class="marginnote-ref">just one sample</span><label for="298b90b7dd6823ffb37aa5cdbc6eeb109ba14086" class="margin-toggle"> ⊕</label><input type="checkbox" id="298b90b7dd6823ffb37aa5cdbc6eeb109ba14086" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Technically the singleton event containing just that sample.</span></span></span> of an infinite sequence of trials, thus apparently solving the inverse probability problem (almost surely) for the i.i.d. case.</p>
<p>As I mentioned earlier, countable events of real numbers are always measure 0 (<a href="https://proofwiki.org/wiki/Countable_Sets_Have_Measure_Zero">proof</a>) for probability measures defined on the reals. Sample set $\Omega_\infty$ has the cardinality of $\real$, and there is a <span class="marginnote-outer"><span class="marginnote-ref">natural bijection to the unit interval</span><label for="3ab93fbd33cfedf7956143fb287aa3dbb5c5101c" class="margin-toggle"> ⊕</label><input type="checkbox" id="3ab93fbd33cfedf7956143fb287aa3dbb5c5101c" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">If the sample space $\Omega$ of each trial is finite, we can think of a sequence $(\omega_1, \omega_2, \ldots)$ as the decimal expansion of a number between 0 and 1 in base $\abs{\Omega}$.</span></span></span>. Therefore there are potentially an infinity of events in $\Omega_\infty$ (countably many) for which BLLN does not hold. As before we may ask a similar question: can these BLLN-violating events happen?</p>
<p>Let’s step back and ask, what is so special about the BLLN anyway? Why should samples satisfy it? In fact, for any particular sample $\omega_\infty$, I can construct a measure 1 set <span class="kdmath">$\Omega_\infty \setminus \{\omega_\infty\}$</span> which does not contain it, simply because the singleton set <span class="kdmath">$\{\omega_\infty\}$</span> has measure 0. Thus it seems that for any sample, there is a “law” which states that it <em>amost surely</em> does not occur. In essence, all samples are special, or none are.</p>
<p>Ming Li and Paul Vitányi in their <a href="https://link.springer.com/book/10.1007%2F978-3-030-11298-1">An Introduction to Kolmogorov Complexity and Its Applications</a> summarize this conundrum quite well:</p>
<blockquote>
<p>We call a sequence ‘random’ if it is ‘typical.’ It is not ‘typical,’ say ‘special,’ if it has a particular distinguishing property. An example of such a property is that an infinite sequence contains only finitely many ones. There are infinitely many such sequences. But the probability that such a sequence occurs as the outcome of fair coin tosses is zero. ‘Typical’ infinite sequences will have the converse property, namely, they contain infinitely many ones.</p>
</blockquote>
<blockquote>
<p>In fact, one would like to say that ‘typical’ infinite sequences will have all converse properties of the properties that can be enjoyed by ‘special’ infinite sequences. This is formalized as follows: If a particular property, such as containing infinitely many occurrences of ones (or zeros), the law of large numbers, or the law of the iterated logarithm, has been shown to have probability one, then one calls this a law of randomness. A sequence is ‘typical,’ or ‘random,’ if it satisfies all laws of randomness.</p>
</blockquote>
<blockquote>
<p>But now we are in trouble. Since all complements of singleton sets in the sample space have probability one, it follows that the intersection of all sets of probability one is empty. Thus, there are no random infinite sequences!</p>
</blockquote>
<p>An elegant solution to this conundrum was discovered by <a href="http://www.nieuwarchief.nl/serie5/pdf/naw5-2018-19-1-044.pdf">Per Martin-Löf</a>, which <span class="marginnote-outer"><span class="marginnote-ref">restricts $P$ to be <a href="https://en.wikipedia.org/wiki/Computable_function">computable</a></span><label for="9d0d67e30e57d6a3fe027c7ed68c9f0b597b6bb7" class="margin-toggle"> ⊕</label><input type="checkbox" id="9d0d67e30e57d6a3fe027c7ed68c9f0b597b6bb7" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">It can be argued that all feasibly usable probability measures are necessarily computable, and so this is not really a restriction at all.</span></span></span>, but that is unfortunately out of scope for this post (I hope to write a future post on Martin-Löf’s solution).</p>
<h1 id="primer-to-measure-theory"><a class="header-anchor" href="#primer-to-measure-theory">Primer to measure theory</a></h1>
<p>Congratulations! You’ve reached end of this post. <button class="advanced-button">Click here</button> (or on any <span class="advanced outer hidden"><span class="advanced inner hidden">purple block</span></span>) to unlock the <span class="advanced outer hidden"><span class="advanced inner hidden">purple text</span></span> on measure theory above. After reading this section, return to the earlier sections and take in the finer precision and details offered by your new found understanding of measure theory.</p>
<p>Terence Tao, in <a href="https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf">An Introduction to Measure Theory</a>, motivates measure theory, saying:</p>
<blockquote>
<p>One of the most fundamental concepts in Euclidean geometry is that of the measure $m(E)$ of a solid body $E$ in one or more dimensions. In one, two, and three dimensions, we refer to this measure as the length, area, or volume of $E$ respectively.<br />
… The physical intuition of defining the measure of a body $E$ to be the sum of the measure of its component “atoms” runs into an immediate problem: a typical solid body would <span class="marginnote-outer"><span class="marginnote-ref">consist of an infinite (and uncountable) number of points</span><label for="2eba793a1e3bccec99c46511d5bb89c632d569b3" class="margin-toggle"> ⊕</label><input type="checkbox" id="2eba793a1e3bccec99c46511d5bb89c632d569b3" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">He is referring to the mathematical ideal of a body being composed of a set of 0-dimensional points.</span></span></span>, each of which has a measure of zero; and the product $\infty \cdot 0$ is indeterminate. To make matters worse, two bodies that have exactly the same number of points, need not have the same measure. For instance, in one dimension, the intervals $A := [0, 1]$ and $B := [0, 2]$ are in one-to-one correspondence (using the bijection $x \mapsto 2x$ from $A$ to $B$), but of course $B$ is twice as long as $A$. So one can disassemble $A$ into an uncountable number of points and reassemble them to form a set of twice the length.</p>
</blockquote>
<p>Terence also mentions the <a href="https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox">Banach-Tarski paradox</a> which shows that even finitely many partitions of a sphere (only 5 are needed!) can be rearranged into two spheres. These kinds of non-measure-preserving sets are always going to be pathological, so the solution is to disallow measurement of these pathological sets. We call those sets <em>non-measurable</em>. If you are curious what non-measurable sets are like, Terence talks about them in section 1.2.3. In the case of the Banach-Tarski paradox, these sets look like fuzzy balls with infinitely many holes in them. The <a href="https://www.youtube.com/watch?v=s86-Z-CbaHA">video on Banach–Tarski by vsauce</a> gives a good visual depiction.</p>
<p>I will not go into how measurable sets can be defined. There are many approaches, the most common of which is due to <a href="https://en.wikipedia.org/wiki/Lebesgue_measure">Lebesgue</a> (Tao section 1.3). It suffices to say that you cannot have all subsets of $\real$ be measurable without giving up <a href="https://en.wikipedia.org/wiki/Non-measurable_set#Consistent_definitions_of_measure_and_probability">desirable properties of <em>measure</em></a>, e.g. that rearranging and rotating disjoint sets does not change their cumulative measure. In what follows, I’m going to assume that for some set $\Omega$ of any cardinality (finite, countable, uncountable, etc.), we just so happen to be in possession of a reasonable set of measurable sets $E \subseteq 2^\Omega$ and the associated measure $P$. Read Terry’s book for details on how to construct such things. I’m merely going to run through the important definitions and terminology pertaining to probability theory, using the naming conventions of probability theory rather than measure theory.</p>
<p>Let $\Omega$ be some set of any cardinality (finite, countable, uncountable, etc.). Assume we are in possession of the set of all measurable subsets $E \subseteq 2^\Omega$, and $P$ is a <strong>measure</strong>. The triple $(\Omega, E, P)$ is called a <strong>measure space</strong>. $(\Omega, E)$ is a <strong>measurable space</strong> (where no measure is specified). Any set $e \in E$ is called <strong>measurable</strong> and $e’ \notin E$ is called <strong>non-measurable</strong>. The signature of $P$ is $E \to \real$, and so it maps only measurable sets to real numbers representing the measures (sizes) of those sets.</p>
<p>There are a few requirements for $P$ that make it behave like a measure. Repeated from <a href="#definitions">above</a>, they are:</p>
<ul>
<li><strong>Non-negativity</strong>: <span class="kdmath">$P(e) \geq 0,\ \forall e \in E$</span>.</li>
<li><strong>Null empty set</strong>: <span class="kdmath">$P(\emptyset) = 0$</span>.</li>
<li><strong>Countable additivity</strong>: For any countable <span class="kdmath">$A \subseteq E$</span> where <span class="kdmath">$\bigcap A = \emptyset$</span>, <span class="kdmath">$P(\bigcup A) = \sum P(A)$</span>, where <span class="kdmath">$P(A) = \{P(e) \mid e \in A\}$</span>.</li>
</ul>
<p><a name="sigma-algebra"></a><span class="jump_to">
Further, $E$ is required to be a <span class="marginnote-outer"><span class="marginnote-ref"><strong>$\sigma$-algebra</strong></span><label for="c317d919018120cca3d580ae386d9ab852907363" class="margin-toggle"> ⊕</label><input type="checkbox" id="c317d919018120cca3d580ae386d9ab852907363" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Following following Tao, section 1.4.2. For further information see <a href="https://en.wikipedia.org/wiki/%CE%A3-algebra">Wikipedia</a>.</span></span></span>, which means it satisfies:</span></p>
<ul>
<li><strong>Empty set</strong>: $\emptyset \in E$.</li>
<li><strong>Complement</strong>: If $e \in E$, then the complement $e^c := \Omega \setminus e$ is also in $E$.</li>
<li><strong>Countable unions</strong>: If $e_1, e_2, \ldots \in E$ then $\bigcup_{n=1}^\infty e_n \in E$.</li>
</ul>
<p>What this all amounts to is that our measure is always non-negative, the empty set is measurable with a measure of 0, compliments and countable unions of measurable sets are measurable, and measure is additive (i.e. sum of measures of disjoint sets equals the measure of the union of those sets).</p>
<p>There’s one more kind of object that probability theory makes heavy use of: the measurable function. Recounting the definition I gave <a href="#motivation-3-construct-events-that-are-guaranteed-measurable">earlier</a>, given two measurable spaces $(A, \mathcal{A})$ and $(B, \mathcal{B})$, a <strong>measurable function</strong> $X : A \to B$ satisfies</p>
<div class="kdmath">$$
X^{-1}(b) \in \mathcal{A},\ \forall b \in \mathcal{B}\,,
$$</div>
<p>where <span class="kdmath">$X^{-1}(b) = \{\alpha \in A \mid X(\alpha) \in B\}$</span> is the pre-image of $X$ on $b \subseteq B$. $X$ never maps a non-measurable subset of $A$ to a measurable subset of $B$, but $X$ could map a measurable subset of $A$ to a non-measurable subset of $B$. We only care about the reverse direction, and it becomes apparent why in the <a href="#motivation-3-construct-events-that-are-guaranteed-measurable">section on random variables</a>.</p>
<p>A <strong>probability measure</strong> is a measure s.t. $P(\Omega) = 1$, i.e. the measure of the entire set $\Omega$ is bounded and equals 1.</p>
Fri, 19 Jun 2020 00:00:00 -0700
danabo.github.io/zhat/articles/primer-probability-theory
danabo.github.io/zhat/articles/primer-probability-theorypostNotes: Probability & AI Curriculum<p>This is a snapshot of my curriculum for exploring the following questions:</p>
<ul>
<li>Is probability theory all you need to develop AI?
<ul>
<li>If not, what is missing?</li>
</ul>
</li>
<li>Should a theory of AI be expressed in the framework of probability theory at all?</li>
<li>Do Brains use probability?</li>
</ul>
<!--more-->
<p>This reflects my current estimate of the landscape, and summarizes where my interests and aspirations have taken me so far. It is not set in stone. I may follow through on it, or I may diverge as I learn more. I primarily follow the current of my curiosity.</p>
<figure><img src="/assets/posts/probability-ai-curriculum/topic-tree.svg" alt="Visualization of topic tree. Nodes are organized hierarchically be level of abstraction, with dotted-lines representing non-hierarchical associations. Colors designate hierarchy-level. Made with <a href="https://www.yworks.com/yed-live/">https://www.yworks.com/yed-live/</a>" width="100%" /><figcaption>Visualization of topic tree. Nodes are organized hierarchically be level of abstraction, with dotted-lines representing non-hierarchical associations. Colors designate hierarchy-level. Made with <a href="https://www.yworks.com/yed-live/">https://www.yworks.com/yed-live/</a></figcaption></figure>
<h1 id="description-of-topics"><a class="header-anchor" href="#description-of-topics">Description of topics</a></h1>
<p>Here are the topics from the graph above, with descriptions to the extent that I understand them, and links to reference material.</p>
<ul>
<li>
<dl>
<dt><strong>Objective probability</strong></dt>
<dd>Is probability an objective property of physical systems in general (not just i.i.d.)? Objective, meaning independently arrived at by multiple parties, like a scientific experiment (just as mass and energy measurements can be independently verified) - i.e. not dependent on a particular brain with particular beliefs. If p(x) = θ, then this is true even if no humans are around at all to believe it. The main problem in making probability objective is figuring out how to uniquely determine the probability of something given observations. What needs to be measured in order to ascertain the objective probability of a system?</dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Solomonoff induction</strong></dt>
<dd>A Bayesian inference setup general enough to encompass general intelligence. Posterior converges to the true data posterior at the infinite limit (for any prior with support everywhere), possibly providing an objective notion of probability, at least for infinite sequences.<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a><br />
‣ <a href="https://arxiv.org/abs/cs/0305052">On the Existence and Convergence Computable Universal Priors</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Approximations</strong></dt>
<dd>How can SI be implemented in practice? How would brains implement it?<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm#approx">http://www.hutter1.net/ai/uaibook.htm#approx</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Posterior convergence</strong></dt>
<dd>The sense in which Solomonoff induction is objective. The predicted posterior converges to the true data posterior with infinite observations, for any prior with support over all hypotheses.<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a>, Theorem 3.19</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Posterior consistency</strong></dt>
<dd>Solomonoff induction may not be consistent, meaning it cannot distinguish between any two hypotheses with infinite data. Implications for objective probability.</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Prior with universally optimal convergence</strong></dt>
<dd>Solomonoff’s universally optimal prior.<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a>, Theorem 3.70</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Convergence on individual sequences</strong></dt>
<dd>Convergence of Solomonoff induction is not guaranteed on a measure-0 set of sequences. Construction of such a sequence.<br />
‣ <a href="https://arxiv.org/abs/cs/0407057">Universal Convergence of Semimeasures on Individual Random Sequences</a>, Theorem 6 and Proposition 12</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>(Non-)Equivalence of Universal Priors</strong></dt>
<dd>A surprising equivalence between mixtures of deterministic programs and computable distributions.<br />
‣ <a href="https://arxiv.org/abs/1111.3854">(Non-)Equivalence of Universal Priors</a>, Theorem 14</dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Martin-Lof randomness</strong></dt>
<dd>What it means for an infinite sequence to be drawn from a probability distribution. Algorithmic definition of randomness (see AIT).<br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a></dd>
</dl>
<ul>
<li><strong>Definition in terms of universal probability</strong><br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a><br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a></li>
<li><strong>Can sequences can be Martin-Lof random w.r.t. multiple probability measures?</strong></li>
</ul>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Bayesian epistemology</strong></dt>
<dd>Are priors and posteriors all that is needed for a complete theory of knowledge, and are a sufficient framework for building an intelligent system? Bayesian epistemology repurposes probability as a property of the intelligent agent doing the observing, rather than the system being observed (or perhaps it characterizes their interaction), i.e. probability as belief.</dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Bayesian brain hypothesis</strong></dt>
<dd>Hypothesis in neuroscience that the Brain is largely an approximate Bayesian inference engine.<br />
‣ <a href="https://pubmed.ncbi.nlm.nih.gov/15541511/">The Bayesian Brain: The Role of Uncertainty in Neural Coding and Computation</a><br />
‣ <a href="https://mitpress.mit.edu/books/bayesian-brain">Bayesian Brain: Probabilistic Approaches to Neural Coding</a><br />
‣ <a href="https://www.annualreviews.org/doi/full/10.1146/annurev.psych.55.090902.142005">Object Perception as Bayesian Inference</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Friston’s free energy principle</strong></dt>
<dd>A unified theory of biological intelligence from which Bayesian epistemology can be derived.<br />
‣ <a href="https://arxiv.org/abs/1901.07945">What does the free energy principle tell us about the brain?</a><br />
‣ <a href="https://www.fil.ion.ucl.ac.uk/~karl/The%20free-energy%20principle%20A%20unified%20brain%20theory.pdf">The free-energy principle: a unified brain theory?</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>How brains approximate Bayesian inference</strong></dt>
<dd>To make the Bayesian brain hypothesis falsifiable, a characterization of what counts as an approximation to Bayesian inference needs to be given. What approximate Bayesian computations in the brain have been found so far by neuroscientists? <em>Reference same sources listen under “Bayesian brain hypothesis”</em></dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Causal inference</strong></dt>
<dd>If Bayesian epistemology is not sufficient, then what is missing? Judea pearl proposes causal inference.<br />
‣ <a href="http://bayes.cs.ucla.edu/BOOK-2K/">Causality</a>, chapters 3 and 7<br />
‣ <a href="https://arxiv.org/abs/1305.5506">Introduction to Judea Pearl’s Do-Calculus</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Bounded Rationality</strong></dt>
<dd>What would Bayesian epistemology theoretically look like with bounded resources? Is Bayesian epistemology no longer optimal given bounded resources?<br />
‣ <a href="https://stanford.edu/~icard/BBRA.pdf">Bayes, Bounds, and Rational Analysis</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Logical justifications</strong></dt>
<dd>Arguments from first principles that Bayesian epistemology is a necessary condition for rationality, and that a rational agent is necessarily a Bayesian agent (such an agent is likely performing Solomonoff induction, in order for it to be sufficiently general in its prediction ability).</dd>
</dl>
<ul>
<li><strong>Dutch book argument</strong></li>
<li><strong>Complete classes</strong></li>
<li><strong>Cox’s theorem</strong></li>
<li><strong>Von Neumann-Morgenstern utility theorem</strong></li>
</ul>
</li>
<li>
<dl>
<dt><strong>Motivation from decision theory</strong></dt>
<dd>Some say a theory is good because it is useful. Perhaps the question “what theory of uncertainty should I use?” is best answered by looking at what we want to do with it, namely decision making under uncertainty. Bayesian epistemology can be motivated by of decision theory.<br />
‣ <a href="https://www.goodreads.com/book/show/1639056.The_Foundations_of_Statistics">The Foundations of Statistics</a>, chapter 3</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Unique priors</strong></dt>
<dd>How to choose a prior is one point of contention in Bayesian epistemology. There are some proposed methods for selecting a unique prior given what you already know, for example, the max-entropy principle.<br />
‣ <a href="https://arxiv.org/abs/1108.2120">Objective Priors: An Introduction for Frequentists</a><br />
‣ <a href="https://arxiv.org/pdf/0808.0012.pdf">LECTURES ON PROBABILITY, ENTROPY, AND STATISTICAL PHYSICS</a></dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Algorithmic information theory (AIT)</strong></dt>
<dd>An alternative to probability theory devised by Kolmogorov himself (and others) to address its shortcomings. Does AIT allow us to formalize the general learning problem of transferring knowledge out-of-distribution?<br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a><br />
‣ <a href="https://bookstore.ams.org/surv-220">Kolmogorov Complexity and Algorithmic Randomness</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Types of Kolmogorov complexity</strong></dt>
<dd>There is a constellation of algorithmic complexity functions that make up the foundation of AIT. <em>Reference same sources listen under “Algorithmic information theory”</em></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Resource bounded complexities</strong></dt>
<dd>Kolmogorov complexity with bounded computation. Possible direction for computable-AIT.<br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a>, chapter 7</dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Algorithmic transfer learning</strong></dt>
<dd>How can the information shared by two datasets be defined? What is the objective of transfer learning?<br />
‣ <a href="http://users.cecs.anu.edu.au/~hassan/univTLTCS.pdf">On Universal Transfer Learning</a><br />
‣ <a href="https://papers.nips.cc/paper/3228-transfer-learning-using-kolmogorov-complexity-basic-theory-and-empirical-evaluations.pdf">Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations</a><br />
‣ <a href="https://arxiv.org/abs/1904.03292">The Information Complexity of Learning Tasks, their Structure and their Distance</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>No free lunch theorem</strong></dt>
<dd>Theorem stating there is no universally best algorithm for all training-test dataset pairs.<br />
‣ <a href="https://www.cse.huji.ac.il/~shais/UnderstandingMachineLearning/">Understanding Machine Learning: From Theory to Algorithms</a>, Theorem 5.1</dd>
</dl>
</li>
</ul>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>AIXI</strong></dt>
<dd>A theory of optimal intelligence put forth by Marcus Hutter based on Solomonoff induction. <br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Data compression</strong></dt>
<dd>Lossless compression from the perspectives of Shannon’s information theory and AIT. Can they be unified? Can compression make probability objective? What is the relationship between compression and intelligence?<br />
‣ <a href="https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959">Elements of Information Theory</a><br />
‣ <a href="http://mattmahoney.net/dc/dce.html">Data Compression Explained</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Decision theory under ignorance</strong></dt>
<dd>Decision theory without probability. Pros and cons.<br />
‣ <a href="https://www.cambridge.org/core/books/an-introduction-to-decision-theory/B9EEB3DCE5D0CAFFB6F3F30B1D0A06A6">An Introduction to Decision Theory</a>, chapter 3</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>The Fundamental Theorem of Statistical Learning (PAC)</strong></dt>
<dd>An introduction to PAC-learning theory. PAC is a probability-theory-based account of machine learning which AIT could replace.<br />
‣ <a href="https://www.cse.huji.ac.il/~shais/UnderstandingMachineLearning/">Understanding Machine Learning: From Theory to Algorithms</a>, Theorem 6.7</dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>PAC account of transfer learning</strong></dt>
<dd>PAC analysis of transfer learning. However, assumptions about relatedness of tasks need to be made.<br />
‣ <a href="https://arxiv.org/abs/1106.0245">A Model of Inductive Bias Learning</a></dd>
</dl>
</li>
</ul>
</li>
</ul>
Wed, 17 Jun 2020 00:00:00 -0700
danabo.github.io/zhat/articles/probability-ai-curriculum
danabo.github.io/zhat/articles/probability-ai-curriculumnotesNotes: Dutch Book Argument<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#axioms-of-probability" id="markdown-toc-axioms-of-probability">Axioms of probability</a> <ul>
<li><a href="#axioms-of-probability-for-propositional-logic" id="markdown-toc-axioms-of-probability-for-propositional-logic">Axioms of probability for propositional logic</a></li>
</ul>
</li>
<li><a href="#i-if-not-bayesian--sure-loss-is-possible" id="markdown-toc-i-if-not-bayesian--sure-loss-is-possible">I. If not Bayesian ⟹ sure loss is possible</a> <ul>
<li><a href="#but-how-does-this-lead-to-bayes-rule" id="markdown-toc-but-how-does-this-lead-to-bayes-rule">But how does this lead to Bayes rule?</a></li>
<li><a href="#but-is-real-life-a-series-of-bets" id="markdown-toc-but-is-real-life-a-series-of-bets">But is real-life a series of bets?</a></li>
</ul>
</li>
<li><a href="#ii-if-bayesian--sure-loss-is-not-possible" id="markdown-toc-ii-if-bayesian--sure-loss-is-not-possible">II. If Bayesian ⟹ sure loss is not possible</a></li>
</ul>
<p>Main source: <a href="https://plato.stanford.edu/entries/dutch-book/">Dutch Book Arguments (SEP)</a></p>
<p>Dutch Book Theorem:</p>
<blockquote>
<p>Given a set of betting quotients that fails to satisfy the probability axioms, there is a set of bets with those quotients that guarantees a net loss to one side.</p>
</blockquote>
<p>Converse Dutch Book Theorem:</p>
<blockquote>
<p>For a set of betting quotients that obeys the probability axioms, there is no set of bets (with those quotients) that guarantees a sure loss (win) to one side.</p>
</blockquote>
<p><a href="https://www.stat.berkeley.edu/~census/dutchdef.pdf">https://www.stat.berkeley.edu/~census/dutchdef.pdf</a>:</p>
<blockquote>
<p>Dutch book cannot be made against a Bayesian bookie.</p>
</blockquote>
<p>I. If not Bayesian ⟹ sure loss is possible</p>
<ul>
<li><a href="https://link.springer.com/chapter/10.1007%2F978-1-4612-0919-5_10">Foresight: Its Logical Laws, Its Subjective Sources</a></li>
</ul>
<p>II. If Bayesian ⟹ sure loss is not possible</p>
<ul>
<li><a href="https://www.jstor.org/stable/2268221?seq=1">On Confirmation and Rational Betting</a></li>
<li><a href="https://www.jstor.org/stable/2268222">Fair Bets and Inductive Probabilities</a></li>
</ul>
<p>Counter-arguments:</p>
<ul>
<li><a href="https://link.springer.com/article/10.1023/A:1004996226545">Hidden Assumptions in the Dutch Book Argument</a></li>
</ul>
<h1 id="axioms-of-probability"><a class="header-anchor" href="#axioms-of-probability">Axioms of probability</a></h1>
<p><strong>The axioms of probability:</strong><br />
Let $(\Omega, \mathcal{E}, P)$ be a measure space, where $\Omega$ is the sample set (mutually exclusive outcomes), $\mathcal{E}$ is the event set (set of measurable subsets of $\Omega$), and $P$ is the probability measure ($P(E),\ \forall E \in \mathcal{E}$ is well defined).</p>
<ol>
<li>$P(E) \geq 0,\ \forall E \in \mathcal{E}$</li>
<li>$P(\Omega) = 1$</li>
<li>$P(E_1 \cup E_2) = P(E_1) + P(E_2) \iff E_1 \cap E_2 = \emptyset,\ \forall E_1,E_2 \in \mathcal{E}$</li>
</ol>
<p>Note that $P(E) \leq 1,\ \forall E \in \mathcal{E}$ follows directly from the axioms.</p>
<h2 id="axioms-of-probability-for-propositional-logic"><a class="header-anchor" href="#axioms-of-probability-for-propositional-logic">Axioms of probability for propositional logic</a></h2>
<p>We can define probability over propositional statements. The sample set $\Omega$ is the set of all truth values of the primitives. If $(A_1, A_2, \ldots)$ is the set of all primitive propositions, then $\Omega = \{(\mathrm{False}, \mathrm{False}, \ldots), (\mathrm{True}, \mathrm{False}, \ldots), (\mathrm{False}, \mathrm{True}, \ldots), (\mathrm{True}, \mathrm{True}, \ldots), \ldots\}$ is every possible truth assignment for $(A_1, A_2, \ldots)$. This is assuming that we don’t know the truth value of any primitive. The <a href="https://en.wikipedia.org/wiki/Logical_connective">logical connectives</a>, $\wedge, \vee, \neg,$ etc., are all shorthands for constructing events (sets of truth assignments for $(A_1, A_2, \ldots)$). In other words, $P(H)$ is shorthand for the probability that proposition $H$ is true, where $H$ denotes an event $E$ containing exactly every truth assignment for $(A_1, A_2, \ldots)$ which makes $H$ true.</p>
<p>Note that when there are finitely many $A_i$, there will be finitely many possible events. However, there are infinitely many logical propositions over finitely many primitives $A_i$. This is because most logical propositions are equivalent to others. In other words, we are making equivalence class over the set of propositions using the sets of primitive assignments that make them true. The set of equivalence classes over propositions is finite for finitely many primitives.</p>
<p>Now the axioms for probability over propositional logical are just a special case of the general axioms:<br />
Let $\mathcal{H}$ be the set of all logical propositions.<br />
Let $\mathrm{True}$ be the proposition <em>True</em>, which is satisfied by all truth assignments of the primitives (i.e. the event containing all samples).</p>
<ol>
<li>$P(H)\geq 0,\ \forall H \in \mathcal{H}$</li>
<li>$P(\mathrm{True}) = 1$</li>
<li>$P(H_1 \vee H_2) = P(H_1) + P(H_2) \iff \neg (H_1 \wedge H_2),\ \forall H_1, H_2 \in \mathcal{H}$</li>
</ol>
<p>Axiom 3 states that the probability of $H_1$ or $H_2$ is the sum of their probabilities iff $H_1$ and $H_2$ cannot both be true at the same time. $H_1 \wedge H_2$ constructs the set of primitive assignments where both propositions are true, which is just the intersection of their respective events, $E_1 \cap E_2$. However, $H_1 \wedge H_2 = \mathrm{False}$ where $\mathrm{False}$ is the empty event is unconventional, so instead we write $\neg (H_1 \wedge H_2)$ which is equivalent to $\overline{E_1 \cap E_2} = \Omega$ (complement). One could also write $P(H_1 \vee H_2) = P(H_1) + P(H_2) \iff P(H_1 \wedge H_2) = 0$.</p>
<p>We could also define probability over 1st order logic. Now $A(x)$ is a proposition on $x$, where $x$ is a non-proposition type (e.g. number). Let’s say $x$ is a natural number, then we have infinitely many primitive propositions $A(x)$ for each $x \in \mathbb{N}$.</p>
<p>How does this inform the claims made in <a href="https://meaningness.com/probability-and-logic">https://meaningness.com/probability-and-logic</a> ?</p>
<h1 id="i-if-not-bayesian--sure-loss-is-possible"><a class="header-anchor" href="#i-if-not-bayesian--sure-loss-is-possible">I. If not Bayesian ⟹ sure loss is possible</a></h1>
<p>The Dutch book argument (DBA) uses traditional terminology around betting which I find to be confusing in this context, like <em>bookie</em>, <em>agent</em>, <em>making book</em>, etc., so I will take care to clarify the meaning of all these things.</p>
<p>Consider an asymmetric two-player betting game. Player 1 transacts with player 2 who sets prices.</p>
<p>Player 1:</p>
<ul>
<li>Called the <em>agent</em></li>
<li>Chooses what bets to buy or sell to/from the bookie at the bookie’s prices.</li>
<li>Chooses the stakes.</li>
<li>Accepts the bookie’s prices.</li>
</ul>
<p>Player 2:</p>
<ul>
<li>Called the <em>bookie</em></li>
<li>Chooses the prices for bets.</li>
<li>Must buy or sell any bets the agent requests, at whatever stakes the agent requests.</li>
</ul>
<p>The DBA shows that the agent can take advantage of the bookie (make book against the bookie) iff the bookie’s bet prices do not conform the the axioms of probability. Here taking advantage of means transacting (buying/selling) a set of bets with the bookie to guarantee the agent wins money off the bookie in every scenario.</p>
<p>We assume the bookie makes all prices on all possible bets known to the agent from the start, and the bookie cannot change these prices. The bookie wants to choose prices such that the agent cannot make book against him/her.</p>
<p>A bet is defined by a stake $S$, betting quotient $q$, and target event $E$. When the agent buys a bet (bookie sells a bet) at stake $S$ with quotient $q$, the agent’s payoff is $S-qS$ if $E$ occurs, and $-qS$ otherwise. $S$ is only payed out to the agent (holder of the stake) when the betting target $H$ is true. $qS$ is paid to the seller regardless of outcome. This is a fee.</p>
<p>The agent can also sell a bet to the bookie, which just negates the entries in the table. The bookie gives the agent a fee, and the agent pays out the stake to the bookie if $H$ is true.</p>
<p>We saw how probability over logical propositions is a special case. I think it is easier to reason about DBA if we instead consider an arbitrary probability distribution over events $E \in \mathcal{E}$. These events are the possible targets of bets. The bookie must choose $q(E)$ for each event. $q(E)$ will end up being a probability measure over $\mathcal{E}$. We will now show that if $q(E)$ violates any of the axioms, the agent can make book against the bookie.</p>
<p>Define a bet as a function $B : \Omega \to \mathbb{R}$ from samples to payoffs. A bet on event $E$ with stake $S$ and quotient $q$ has payoffs (w.r.t. buyer):</p>
<div class="kdmath">$$
B_E(\omega) = \begin{cases}S-q(E)S & \omega \in E \\ -q(E)S & \omega \notin E\end{cases}
$$</div>
<p>This can be represented as a table:</p>
<table>
<thead>
<tr>
<th>Result</th>
<th>Payoff</th>
</tr>
</thead>
<tbody>
<tr>
<td>$E$</td>
<td>$S-q(E)S$</td>
</tr>
<tr>
<td>$\overline{E}$</td>
<td>$-q(E)S$</td>
</tr>
</tbody>
</table>
<p>Assuming the stake $S$ is always the same (this argument is invariant to stake, as long as its positive), then a bet is represented by $B_E$. Since this game is zero-sum, from the seller’s perspective, the payoff is $-B_E$. Buying $-B_E$ is equivalent to selling $B_E$. We can also add bets like this</p>
<div class="kdmath">$$
\left(B_{E_1} + B_{E_2}\right)(\omega) = B_{E_1}(\omega) + B_{E_2}(\omega)\,,
$$</div>
<p>to construct a more complicated multi-outcome bet, denoted as $B_{E_1} + B_{E_2}$.</p>
<p>Now I am ready to outline why bets should conform to the three axioms:</p>
<p><strong>Axiom 1:</strong> $P(E) \geq 0,\ \forall E \in \mathcal{E}$.<br />
Assume $q(E) < 0$ for some $E$.<br />
Then the agent will buy $B_E$, which has a positive payoff in all cases.</p>
<p><strong>Axiom 2:</strong> $P(\Omega) = 1$.<br />
Note that by definition event $\Omega$ always happens, and this is known to both the agent and bookie.<br />
Assume $q(\Omega) < 1$.<br />
Then the agent will buy $B_\Omega$ since the payoff is always $S-q(\Omega)S$ which is positive.<br />
Assume $q(\Omega) > 1$.<br />
Then the agent will buy -$B_\Omega$ (sell $B_\Omega$) since the payoff is always $-(S-q(\Omega)S)$ which is positive.</p>
<p><strong>Axiom 3:</strong> $P(E_1 \cup E_2) = P(E_1) + P(E_2) \iff E_1 \cap E_2 = \emptyset,\ \forall E_1,E_2 \in \mathcal{E}$.<br />
Note that by definition $E_1 \cap E_2 = \emptyset$ implies $E_1$ and $E_2$ cannot happen simultaneously (by definition of the empty event), and this known to both the agent and bookie.<br />
Assume $E_1 \cap E_2 = \emptyset$ for some $E_1, E_2$.<br />
Assume $q(E_1 \cup E_2) > q(E_1) + q(E_2)$.<br />
Then the agent will buy $B_{E_1} + B_{E_2} - B_{E_1 \cup E_2}$ which has payoff table (w.r.t. agent):</p>
<table>
<thead>
<tr>
<th>Result</th>
<th>Payoff</th>
</tr>
</thead>
<tbody>
<tr>
<td>$E_1$</td>
<td>$-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$</td>
</tr>
<tr>
<td>$E_2$</td>
<td>$-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$</td>
</tr>
<tr>
<td>$\overline{E_1 \cup E_2}$</td>
<td>$-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$</td>
</tr>
</tbody>
</table>
<p>The payoff is the same in all cases and $E_1 \cap E_2$ never occurs. $-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$ is positive.<br />
Assume $q(E_1 \cup E_2) < q(E_1) + q(E_2)$.<br />
Then the agent buys $-B_{E_1} - B_{E_2} + B_{E_1 \cup E_2}$, which is easy to show wins money for the agent in every scenario.</p>
<p>Thus it is wise for the bookie to choose $q : \mathcal{E} \to \mathbb{R}$ s.t. it obeys the three axioms of probability.</p>
<p>The DBA is ingenious because it does not assume any a priori probabilities over outcomes (i.e. objective probability), and it holds for 1-shot events (i.e. does not assume the game is repeatable).</p>
<h2 id="but-how-does-this-lead-to-bayes-rule"><a class="header-anchor" href="#but-how-does-this-lead-to-bayes-rule">But how does this lead to Bayes rule?</a></h2>
<p>Bayesian epistemology centers around using Bayes rule to compute a posterior from a prior. Where is the prior and posterior here?</p>
<p>$q(E)$ is not a prior because $E$ is a datum, not a hypothesis. The DBA concludes that $q$ should be a valid probability measure. But how do we do all the fancy stuff that Bayesian inference requires like marginalizing over variables and computing conditional probabilities? To do that, we need at least two random variables, which we could define over our sample space. What would those two RVs be?</p>
<p>In the framing of DBA, the world starts out completely unknown, and at the conclusion of the betting becomes completely known. There is no reason for a prior or posterior distribution. $q(E)$ is a likelihood distribution conditioned on nothing, i.e. probability of data without regard to hypotheses. There is nothing Bayesian about this because we are <em>literally not using Bayes’ rule because we only have one random variable!</em></p>
<p>However, the DBA is clearly suggesting that $q(E)$ should encode the bookie’s beliefs about the outcomes, and is thus the prior. So then what is the posterior? We can get around this conundrum by supposing the bookie takes bets on the outcomes of some underlying process, i.e. time series, and updates $q(E)$ as time passes and outcomes are observed. Now we are computing a type of posterior: $q(E_t \mid E_{1:t-1})$ where $E_{1:t-1}$ is all previous observations. Hypotheses are technically not needed, but the bookie is free to secretly have a second RV over hypotheses under the hood (maybe the bookie is doing Solomonoff induction).</p>
<p><strong>Question:</strong> Are we confusing frequentist and Bayesian probability here?<br />
In the Bayesian paradigm, hypotheses are themselves usually probability distributions, i.e. $p(X \mid H=h) = p_h(X)$ where $p_h(X)$ is a hypothesis labeled with $h$. What is the meaning of the probabilities in $p_h(X)$? Are these probabilities objective? If not, what does it mean for a hypothesis to be satisfied by data? We could consider likelihood to be a score, rather than an objective quantity, and a better hypothesis has a better score by definition (rather than thinking of the likelihood of data under the hypothesis as a frequentist prediction that can be tested through repeated experiment).</p>
<h2 id="but-is-real-life-a-series-of-bets"><a class="header-anchor" href="#but-is-real-life-a-series-of-bets">But is real-life a series of bets?</a></h2>
<p>The setup of the game described above ends up being isomorphic to probability theory.</p>
<p><strong>Question:</strong> Why does this isomorphism exist? Is there something intrinsic about betting that makes it conform to the rules of probability, or is this an artifact of the particular betting payout definition we are using?<br />
This payout scheme is apparently economically justified and not arbitrarily chosen, i.e. bets (with predetermined payouts) traded on a market with have purchase prices that converge to the model above (assuming sufficient arbitrage). Note, IRL quotients are discretized and don’t sum exactly to 1, and there is a ask-bid spread which essentially ads a transaction cost to everything. Real world example: <a href="https://www.predictit.org/markets/detail/3698">https://www.predictit.org/markets/detail/3698</a>. In economics, decisions under uncertainty are modeled in the same betting form. e.g. insurance (premium is the quotient, payout is the stake).</p>
<p><strong>Question:</strong> During the course of everyday life, is the universe going to make book against us lest we conform to the rules of probability? Do the sorts of real-life bets we actually encounter and place have the same structure as the idealized betting above?</p>
<p>Who is the bookie and agent? DBA says that you are the bookie over the course of your life, and you want to prevent the universe (or adversarial actors) from taking advantage of you. The problem is that you are often the one making decisions, i.e. deciding the bets to place. This is a different game, where the bookie chooses the quotients and the bets together. Also, real-life is not zero-sum. You will encounter win-win and lose-lose situations where you have to place a bet one way or the other and net win or lose. The universe is not an optimally rational agent either. I don’t expect the universe to spontaneously Dutch-book me. I don’t even expect people to Dutch-book me, because that would take work. In practice, everyone is not acting optimally.</p>
<p><strong>Question:</strong> What if outcomes are not binary? i.e. $\omega \in E$ and $\omega \notin E$ are not the only possibilities (i.e. don’t assume law of excluded middle, i.e. need to construct a proof for $\omega \in E$ or $\omega \notin E$).<br />
For example, what if it is not always possible to determine whether an event occurred? This is the case in Solomonoff induction, which uses a semi-measure rather than a measure to get around this problem. In practice, with things like elections and trials there is a large vested interest in ensuring an outcome is determined. But over the course of everyday life, there are many ambiguities.</p>
<p>In real life, you are more like the agent. You choose your bets (take actions) with pre-defined payoffs (the downstream results of your actions are not usually in your control). These payoff are not logically determined, but are the result of (often arbitrary) circumstance. It is very easy to Dutch book the universe! That’s generally how growth and progress happen.</p>
<p>Presumably in a formal betting scenario the bookie’s probabilities are well-tuned, so that the bookie is indifferent to whether someone buys or sells a given bet. In everyday life, the payoffs of your decisions usually do not match your preferred betting quotients, so that there is one or a few best bets. The whole point of betting is that you believe the outcomes don’t match the “true” betting quotients. The DBA assumes that someone else might give you a series of bets which are locally in agreement with your quotients but globally a guaranteed loss. The problem is, you may not be compelled to take bets that agree with your expectations, but only take bets where the expected return is positive, i.e. disagreement.</p>
<h1 id="ii-if-bayesian--sure-loss-is-not-possible"><a class="header-anchor" href="#ii-if-bayesian--sure-loss-is-not-possible">II. If Bayesian ⟹ sure loss is not possible</a></h1>
<p>TODO</p>
Thu, 11 Jun 2020 00:00:00 -0700
danabo.github.io/zhat/articles/notes-dutch-book-argument
danabo.github.io/zhat/articles/notes-dutch-book-argumentnotesNotes: Complete Class Theorems<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#results-to-understand-in-hoff" id="markdown-toc-results-to-understand-in-hoff">Results to understand in Hoff</a></li>
<li><a href="#notes" id="markdown-toc-notes">Notes</a> <ul>
<li><a href="#complete-class-theorem-i" id="markdown-toc-complete-class-theorem-i">Complete class theorem I</a></li>
<li><a href="#complete-class-theorem-ii" id="markdown-toc-complete-class-theorem-ii">Complete class theorem II</a></li>
<li><a href="#euclidean-parameter-spaces" id="markdown-toc-euclidean-parameter-spaces">Euclidean parameter spaces</a></li>
<li><a href="#complete-class-theorem-iii" id="markdown-toc-complete-class-theorem-iii">Complete class theorem III</a></li>
</ul>
</li>
<li><a href="#interpretation-and-implications" id="markdown-toc-interpretation-and-implications">Interpretation and implications</a> <ul>
<li><a href="#discussion" id="markdown-toc-discussion">Discussion</a></li>
</ul>
</li>
</ul>
<p><strong>Objective:</strong> I want to understand the complete class theorems because they are a common argument for Bayesian epistemology, a theory of knowledge that puts forward Bayesian posterior calculation as all you need. In order to properly evaluate whether “being Bayesian” is enough of a theoretical framework to build and explain intelligence, I need to understand arguments for Bayesian epistemology.</p>
<p>The argument boils down to:</p>
<blockquote>
<p>If you agree with expected utility as your objective, then you have to be Bayesian.</p>
</blockquote>
<p>In a nutshell: An strategy is inadmissible if there exists another strategy that is as good in all situations and strictly better in at least one. If you want your strategy to be admissible, it should be equivalent to a Bayes estimator.</p>
<p>Complete class theorems: Only Bayes strategies are admissible, and admissible strategies are Bayes.</p>
<p>I’m mainly following <a href="https://www.stat.washington.edu/people/pdhoff/courses/581/LectureNotes/admiss.pdf">Admissibility and complete classes - Peter Hoff</a>.</p>
<p>Related study notes: <a href="https://docs.google.com/document/d/1fCseo1fsPwJfjnehauAzOr4bf1GHHRfRW6cHwNQTNu4/edit">Wald’s Complete Class Theorem(s) - study notes</a></p>
<h1 id="results-to-understand-in-hoff"><a class="header-anchor" href="#results-to-understand-in-hoff">Results to understand in <a href="https://www.stat.washington.edu/people/pdhoff/courses/581/LectureNotes/admiss.pdf">Hoff</a></a></h1>
<p><strong>Section 1</strong>:<br />
<img src="https://i.imgur.com/KSZ6PVb.png" alt="" /></p>
<p><strong>Section 2</strong>:<br />
<img src="https://i.imgur.com/F94ljVs.png" alt="" /><br />
<img src="https://i.imgur.com/2QW8pcP.png" alt="" /></p>
<p><strong>Section 3</strong>:<br />
<img src="https://i.imgur.com/U2npCDa.png" alt="" /></p>
<p><strong>Section 4</strong>:<br />
<img src="https://i.imgur.com/XPqhZ4E.png" alt="" /><br />
<img src="https://i.imgur.com/H8uda4H.png" alt="" /><br />
<img src="https://i.imgur.com/fAvOCcu.png" alt="" /></p>
<p><strong>Section 5</strong> covers similar results for infinite parameter spaces (so far results are for finite parameter spaces).</p>
<p><strong>Section 6</strong>:<br />
<img src="https://i.imgur.com/B9EDHSE.png" alt="" /></p>
<p><img src="https://i.imgur.com/TQjLvpT.png" alt="" /><br />
<img src="https://i.imgur.com/RRj8Mwi.png" alt="" /><br />
<img src="https://i.imgur.com/XVFp6DY.png" alt="" /></p>
<h1 id="notes"><a class="header-anchor" href="#notes">Notes</a></h1>
<div class="kdmath">$$
\newcommand{\bb}{\mathbb}
\newcommand{\mc}{\mathcal}
\newcommand{\d}{\delta}
\newcommand{\p}{\pi}
\newcommand{\t}{\theta}
\newcommand{\T}{\Theta}
\newcommand{\fa}{\forall}
\newcommand{\ex}{\exists}
\newcommand{\real}{\bb{R}}
\newcommand{\E}{\bb{E}}
\renewcommand{\D}[1]{\operatorname{d}\!{#1}}
\DeclareMathOperator*{\argmin}{argmin}
$$</div>
<p>Let $(\mc{X}, \mc{A}, P_\t)$ be a probability space for all $\t \in \T$.<br />
$\mc{X}$ is the sample space.<br />
$\T$ is the parameter space.<br />
$\mc{P} = \{P_\t : \t \in \T\}$ is the <em>model</em>, i.e. the set of all probability measures specified by the parameter space.</p>
<p>We wish to estimate some unknown $g(\t)$ which depends in a known way on $\t$. The text does not tell us what type $g(\t)$ is, and it does not matter for the discussion since it will always be hidden behind our loss function. The text uses $g(\T)$ (the image of $g$) to denote the space of all such $g$, but I find it less confusing and more direct to use $G = g(\T)$.</p>
<p>A <strong>loss function</strong> is a function $L : \T \times G \to \real^+$ which is always 0 for equivalent inputs, i.e.<br />
<span class="kdmath">$L(\t, g(\t)) = 0,\ \fa \t \in \T\,.$</span><br />
Note that $L(\t_1, g(\t_2))$ may be 0 when $\t_1 \neq \t_2$.</p>
<p>A <strong>non-randomized estimator</strong> for $g(\t)$ is a function $\d : \mc{X} \to G$ s.t. $x \mapsto L(\t, \d(x))$ is a measurable function (of $x$) for all $\t \in \T$. A <a href="https://en.wikipedia.org/wiki/Measurable_function">function is measurable</a> if the preimage of any measurable set is measurable, i.e. it preserves measurability. Concretely in this case, $\{x : L(\t, \d(x)) \in B\} \in \mc{A}$ for all $B \in \mc{B}(\real)$, where $\mc{A}$ is our event space (set of all subsets of $\mc{X}$ which can be measured by $P_\t$), and $\mc{B}(\real)$ is the <a href="https://mathworld.wolfram.com/BorelSet.html">Borel $\sigma$-algebra</a> over the reals, which is a standard definition of measurable sets of reals (unions and intersections of closed and open intervals are measurable). Presumably $\d$ is non-randomized because it only depends on the ground truth $x$.</p>
<p>The <strong>risk function</strong> of estimator $\d$ is the expected loss:<br />
<span class="kdmath">$R(\t, \d) = \E_{x \sim X}\left[L(\t, \d(x)) \mid \t\right] = \int_\mc{X} L(\t, \d(x))P_\t(x) \D{x}$</span></p>
<p>A <strong>randomized estimator</strong> is a function $\d : \mc{X} \times [0, 1] \to G$ s.t. $(x, u) \mapsto L(\t, \d(x, u))$ is a measurable function (of $x$ and $u$) for all $\t \in \T$. Just like a non-randomized estimator, except it recieves noise from $U \sim \mathrm{uniform}([0, 1])$ as input. Non-randomized estimators are a special case (ignores the random input). Conversely, a randomized estimator can be viewed as a distribution over non-randomized estimators (which are parametrized by $u \in [0, 1]$).</p>
<p>The risk function then integrates over $u$:<br />
<span class="kdmath">$R(\t, \d) = \E_{x \sim X, u \sim U}\left[L(\t, \d(x, u)) \mid \t\right] = \int_0^1 \int_\mc{X} L(\t, \d(x, u))P_\t(x) \D{x} \D{u}$</span></p>
<p>An estimator $\d_1$ <strong>dominates</strong> another estimator $\d_2$ iff<br />
\begin{align}<br />
\fa \t \in \T,\ R(\t, \d_1) \leq R(\t, \d_2)\,, \<br />
\ex \t \in \T,\ R(\t, \d_1) < R(\t, \d_2)\,.<br />
\end{align}<br />
$\d_1$ must be at least as good (same risk or less) as $\d_2$ in every situation, and must be strictly better (less risk) in at least one situation, for the descriptor <em>dominance</em> to apply.</p>
<p>An estimator $\d$ is <strong>admissible</strong> if it is not dominated by any estimator.<br />
Admissibility does not mean an estimator is any good, however, but any inadmissible estimator can be automatically ruled out.</p>
<p>Let $\mc{D}$ be the set of all randomized estimators.<br />
A <strong>class</strong> (subset) of estimators $\mc{C} \subset \mc{D}$ is <strong>complete</strong> iff $\fa \d’ \in \mc{C}^c,\ \ex \d \in \mc{C}$ that dominates $\d’$.<br />
Here $(\cdot)^c$ is the compliment operator, i.e. $\mc{C}^c = \{\d’ \in \mc{D} : \d’ \notin \mc{C}\}$.</p>
<p>Let $\p$ be a probability measure on $\T$ and $\d$ be an estimator (from here on it does not matter if $\d$ is randomized or not because the risk does not depend on the arguments of $\d$).</p>
<p>The <strong>Bayes risk</strong> of $\d$ w.r.t. $\p$ is</p>
<div class="kdmath">$$
R(\p, \d) = \E_{\p(\t)}[R(\t, \d)] = \int_\T R(\t, \d) \p(\t) \D{\t}\,.
$$</div>
<p>This is the expected risk w.r.t. $\p(\t)$, which is called our <strong>prior</strong>.</p>
<p>Bayes risk allows us to compare estimators by comparing numbers rather than functions, but now we have a new problem, which is that we have to choose a prior.</p>
<p>$\d$ is a <strong>Bayes estimator</strong> w.r.t. $\p$ iff</p>
<div class="kdmath">$$
R(\p, \d) \leq R(\p, \d'),\ \fa \d' \in \mc{D}\,.
$$</div>
<p>Note that a Bayes estimator $\t$ can be dominated if $\pi$ assigns measure 0 to some subsets of $\T$. It is easy to show that if $\t$ is dominated by $\t’$, then $\t’$ is also Bayes and $R(\p, \d) = R(\p, \d’)$.</p>
<p><strong>Theorem 1</strong> (Bayes $\implies$ admissible): If prior $\pi(\theta)$ has exactly one Bayes estimator, then that estimator is admissible.</p>
<blockquote>
<p>Thus the only thing that can dominate a Bayes estimator is another Bayes estimator. If there is only one Bayes estimator for a given prior, then it must be admissible.</p>
</blockquote>
<p><strong>Question:</strong> Under what conditions is there more than one Bayes estimator for a given prior?</p>
<p><strong>Theorem 3</strong> (Bayes $\implies$ admissible):<br />
<img src="https://i.imgur.com/9H363wf.png" alt="" /></p>
<h2 id="complete-class-theorem-i"><a class="header-anchor" href="#complete-class-theorem-i">Complete class theorem I</a></h2>
<p>(admissible $\implies$ Bayes)</p>
<blockquote>
<p>If $\d$ is admissible and $\T$ is finite, then $\d$ is Bayes (w.r.t some prior distribution).</p>
</blockquote>
<h2 id="complete-class-theorem-ii"><a class="header-anchor" href="#complete-class-theorem-ii">Complete class theorem II</a></h2>
<p>Class of Bayes estimators is complete</p>
<blockquote>
<p>If $\T$ is finite and $\mc{S}$ is closed then the class of Bayes rules is complete and the admissible rules form a minimal complete class.</p>
</blockquote>
<h2 id="euclidean-parameter-spaces"><a class="header-anchor" href="#euclidean-parameter-spaces">Euclidean parameter spaces</a></h2>
<p>TODO: generalized Bayes estimator<br />
TODO: limiting Bayes estimator</p>
<p>Bayes $\implies$ Admissible<br />
<img src="https://i.imgur.com/BZJqBWn.png" alt="" /></p>
<p>Admissible $\implies$ Bayes<br />
<img src="https://i.imgur.com/CGYqfbF.png" alt="" /></p>
<h2 id="complete-class-theorem-iii"><a class="header-anchor" href="#complete-class-theorem-iii">Complete class theorem III</a></h2>
<p>Class of Bayes estimators is complete<br />
<img src="https://i.imgur.com/CFCCIMO.png" alt="" /></p>
<h1 id="interpretation-and-implications"><a class="header-anchor" href="#interpretation-and-implications">Interpretation and implications</a></h1>
<p><strong>Question:</strong> What is the connection between <a href="https://en.wikipedia.org/wiki/Bayes_estimator#Definition">Bayesian estimators</a> and Bayesian posteriors?</p>
<p>Answer: Bayes estimator predicts the mean posterior for L2 loss, median for L1 loss. [credit: John Chung]</p>
<p><strong>Theorem</strong>:<br />
If $\p(\t)$ is a given prior, then a corresponding Bayes estimator is $\d$ is</p>
<div class="kdmath">$$
\d(x) = \argmin_{\hat{\t}} \E_{\t \sim p_\p(\t \mid x)}\left[L(\t, \hat{\t})\right] = \argmin_{\hat{\t}} \int_{\T} L(\t, \hat{\t}) p_\pi(\t \mid x) \D{\t}\,,
$$</div>
<p>where the posterior is $p_\pi(\t \mid x) = P_\t(x)\pi(\t)/p_\p(x)$ and marginal data distribution is $p_\p(x) = \int P_\t(x)\pi(\t) \D{\t}$.<br />
In words, the Bayes estimator minimizes the posterior expected loss for every $x$.</p>
<p><em>Proof:</em><br />
<br />(This proof is my own)</p>
<div class="kdmath">$$
\begin{align}
\min_{\hat{\d}}R(\p, \hat{\d}) &= \min_{\hat{\d}} \int_\mc{X}\int_\T L(\t, \hat{\d}(x)) P_\t(x)\p(\t) \D{\t}\D{x} \\
&= \min_{\hat{\d}} \int_\mc{X}\left(\int_\T L(\t, \hat{\d}(x)) p_\pi(\t \mid x) \D{\t}\right) p_\p(x) \D{x} \\
&= \int_\mc{X}\left(\min_{\hat{\d}_x} \int_\T L(\t, \hat{\d}_x) p_\pi(\t \mid x) \D{\t}\right) p_\p(x) \D{x} \\
&= \E_{x \sim p_\p(x)}\left[\min_{\hat{\d}_x} \int_\T L(\t, \hat{\d}_x) p_\pi(\t \mid x) \D{\t}\right] \\
&= \E_{x \sim p_\p(x)}\left[\min_{\hat{\d}_x} \E_{\t \sim p_\p(\t \mid x)}\left[L(\t, \hat{\d}_x)\right] \right]\,.
\end{align}
$$</div>
<p>So the min Bayes risk is expected (w.r.t. data) minimum “posterior expected loss”.</p>
<p>Thus if we define $\d(x) := \d^*_x,\ \forall x \in \mc{X}$, where</p>
<div class="kdmath">$$
\d^*_x = \argmin_{\hat{\d}_x} \E_{\t \sim p_\p(\t \mid x)}\left[L(\t, \hat{\d}_x)\right]\,,
$$</div>
<p>then $\d = \argmin_\hat{\d} R(\p, \hat{\d})\,.$<br />
<em>QED</em></p>
<p>The general form</p>
<div class="kdmath">$$
b^* = \argmin_b \E_A \left[L(A, b)\right]
$$</div>
<p>is called the <em>systematic part</em> of random variable $A$. When $L$ is squared difference (i.e. $\ell^2$), then $b^*$ is the mean of $A$. When $L$ is absolute difference (i.e. $\ell^1$), then $b^*$ is the median of $A$. When $L$ is the indicator loss (i.e. $\ell^0$), then $b^*$ is the mode of $A$. There are also losses corresponding to other distribution statistics like quantile loss. See the definition of <em>systematic part</em> in my post on the <a href="http://zhat.io/articles/bias-variance#bias-variance-decomposition-for-any-loss">generalized bias-variance decomposition</a>.</p>
<p>$\d$ will be the mean, median, or mode of the posterior for $\ell^2$, $\ell^1$, $\ell^0$ losses respectively. To avoid confusion, here it is stated explicitly:</p>
<p>If $L(\t, \hat{\t}) = (\t - \hat{\t})^2$, then</p>
<div class="kdmath">$$
\d(x) = \mathrm{Mean}_{\t \sim p_\p(\t \mid x)}\left[\t\right] = \E_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,.
$$</div>
<p>If $L(\t, \hat{\t}) = \lvert\t - \hat{\t}\rvert$, then</p>
<div class="kdmath">$$
\d(x) = \mathrm{Median}_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,.
$$</div>
<p>If $L(\t, \hat{\t}) = (\t - \hat{\t})^0$, then</p>
<div class="kdmath">$$
\d(x) = \mathrm{Mode}_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,.
$$</div>
<p>If <span class="kdmath">$L(\t, \hat{\t}) = \begin{cases}\tau\cdot(\t - \hat{\t}) & \t - \hat{\t} \geq 0 \\ (\tau-1)\cdot(\t - \hat{\t}) & \mathrm{otherwise}\end{cases},$</span> then</p>
<div class="kdmath">$$
\d(x) = \mathrm{Quantile}\{\tau\}_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,,
$$</div>
<p>and $\tau=\frac{1}{2}$ gives the median.</p>
<h2 id="discussion"><a class="header-anchor" href="#discussion">Discussion</a></h2>
<p>Do the complete class theorems prove the necessity of Bayesian epistemology (assuming you wish to be rational)?</p>
<ol>
<li>Complete class theorems assume the data has a well defined probability distribution. If we use CCTs to justify Bayesian epistemology (i.e. usage of probability for outcomes which do not repeat, have a frequency or occurrence, or any well defined objective notion of probability) then this argument is circular. It depends on frequentist probability being a thing, and Bayesian probability is enticing over frequentist probability because frequentist probability only makes sense in limited circumstances where events have well defined frequencies of occurrence.</li>
<li>Enforcing admissibility may be inconsequential. This framework is silent on how to define the hypothesis space and choose a prior, which matters quite a lot for 1 shot prediction, but doesn’t matter at infinite data limit. In practice we don’t care about the infinite data limit. In practice, picking the wrong hypothesis space or a bad prior may impact your utility much more than being admissible.</li>
<li>The result above shows that only the <em>systematic part</em> (e.g. mean) of the posterior matters for minimizing Bayes risk.</li>
</ol>
Thu, 11 Jun 2020 00:00:00 -0700
danabo.github.io/zhat/articles/notes-complete-class-theorems
danabo.github.io/zhat/articles/notes-complete-class-theoremsnotesPrimer to Shannon's Information Theory<p>Shannon’s theory of information is usually just called <em>information theory</em>, but is it deserving of that title? Does Shannon’s theory completely capture every possible meaning of the word <em>information</em>? In the grand quests to creating AI and understanding the rules of the universe (i.e. grand unified theory) information may be key. Intelligent agents search for information and manipulate it. Particle interactions in physics may be viewed as information transfer. The physics of information may be key to interpreting quantum mechanics and resolving the measurement problem.</p>
<p>If you endeavor to answer these hard questions, it is prudent to understand existing so-called theories of information so you can evaluate whether they are powerful enough and to take inspiration from them.</p>
<p>Shannon’s information theory is a hard nut to crack. Hopefully this primer gets you far enough along to be able to read a textbook like <em>Elements of Information Theory</em>. At the end I start to explore the question of whether Shannon’s theory is a complete theory of information, and where it might be lacking.</p>
<p>This post is long. That is because Shannon’s information theory is a framework of thought. That framework has a vocabulary which is needed to appreciate the whole. I attempt to gradually build up this vocabulary, stopping along the way to build intuition. With this vocabulary in hand, you will be ready to explore the big questions at the end of this post.</p>
<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#self-information" id="markdown-toc-self-information">Self-Information</a> <ul>
<li><a href="#regarding-notation" id="markdown-toc-regarding-notation">Regarding notation</a></li>
<li><a href="#bits-not-bits" id="markdown-toc-bits-not-bits"><em>Bits</em>, not bits</a> <ul>
<li><a href="#recap" id="markdown-toc-recap">Recap</a></li>
</ul>
</li>
<li><a href="#stepping-back" id="markdown-toc-stepping-back">Stepping back</a></li>
</ul>
</li>
<li><a href="#entropy" id="markdown-toc-entropy">Entropy</a> <ul>
<li><a href="#regarding-notation-1" id="markdown-toc-regarding-notation-1">Regarding notation</a></li>
<li><a href="#conditional-entropy" id="markdown-toc-conditional-entropy">Conditional Entropy</a></li>
</ul>
</li>
<li><a href="#mutual-information" id="markdown-toc-mutual-information">Mutual Information</a> <ul>
<li><a href="#pointwise-mutual-information" id="markdown-toc-pointwise-mutual-information">Pointwise Mutual Information</a></li>
<li><a href="#properties-of-pmi" id="markdown-toc-properties-of-pmi">Properties of PMI</a> <ul>
<li><a href="#special-values" id="markdown-toc-special-values">Special Values</a></li>
</ul>
</li>
<li><a href="#expected-mutual-information" id="markdown-toc-expected-mutual-information">Expected Mutual Information</a> <ul>
<li><a href="#channel-capacity" id="markdown-toc-channel-capacity">Channel capacity</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#shannon-information-for-continuous-distributions" id="markdown-toc-shannon-information-for-continuous-distributions">Shannon Information For Continuous Distributions</a> <ul>
<li><a href="#proof-that-mi-is-fininte-for-continuous-distributions" id="markdown-toc-proof-that-mi-is-fininte-for-continuous-distributions">Proof that MI is fininte for continuous distributions</a></li>
</ul>
</li>
<li><a href="#problems-with-shannon-information" id="markdown-toc-problems-with-shannon-information">Problems With Shannon Information</a> <ul>
<li><a href="#1-tv-static-problem" id="markdown-toc-1-tv-static-problem">1. TV Static Problem</a></li>
<li><a href="#2-shannon-information-is-blind-to-scrambling" id="markdown-toc-2-shannon-information-is-blind-to-scrambling">2. Shannon Information is Blind to Scrambling</a></li>
<li><a href="#3-deterministic-information" id="markdown-toc-3-deterministic-information">3. Deterministic information</a></li>
<li><a href="#4-if-the-universe-is-continuous-everything-contains-infinite-information" id="markdown-toc-4-if-the-universe-is-continuous-everything-contains-infinite-information">4. If the universe is continuous everything contains infinite information</a></li>
<li><a href="#5-shannon-information-ignores-the-meaning-of-messages" id="markdown-toc-5-shannon-information-ignores-the-meaning-of-messages">5. Shannon information ignores the meaning of messages</a></li>
<li><a href="#6-probability-distributions-are-not-objective" id="markdown-toc-6-probability-distributions-are-not-objective">6. Probability distributions are not objective</a></li>
</ul>
</li>
<li><a href="#appendix" id="markdown-toc-appendix">Appendix</a> <ul>
<li><a href="#properties-of-conditional-entropy" id="markdown-toc-properties-of-conditional-entropy">Properties of Conditional Entropy</a></li>
<li><a href="#bayes-rule" id="markdown-toc-bayes-rule">Bayes’ Rule</a></li>
<li><a href="#cross-entropy-and-kl-divergence" id="markdown-toc-cross-entropy-and-kl-divergence">Cross Entropy and KL-Divergence</a></li>
</ul>
</li>
<li><a href="#acknowledgments" id="markdown-toc-acknowledgments">Acknowledgments</a></li>
</ul>
<h1 id="self-information"><a class="header-anchor" href="#self-information">Self-Information</a></h1>
<div class="kdmath">$$
\newcommand{\and}{\wedge}
\newcommand{\or}{\vee}
\newcommand{\E}{\mathbb{E}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\bm}{\boldsymbol}
\newcommand{\rX}{\bm{X}}
\newcommand{\rY}{\bm{Y}}
\newcommand{\rZ}{\bm{Z}}
\newcommand{\rC}{\bm{C}}
\newcommand{\diff}[1]{\mathop{\mathrm{d}#1}}
\newcommand{\kl}[2]{K\left[#1\;\middle\|\;#2\right]}
$$</div>
<p>I’m going to use non-standard notation which I believe avoids some confusion and ambiguities.</p>
<p>Shannon defines information indirectly by defining quantity of information contained in a message/event. This is analogous how physics defines mass and energy in terms of their quantities.</p>
<p>Let’s define $x$ to be any mathematical object from a set of possibilities $X$. We typically call $x$ a <em>message</em>, but it can also be referred to as an <em>outcome</em>, <em>state</em>, or <em>event</em> depending on the context.</p>
<p>Define <span class="marginnote-outer"><span class="marginnote-ref">$h(x)$</span><label for="ad226219b7413f6c19c43a404b5a9d36708aa40e" class="margin-toggle"> ⊕</label><input type="checkbox" id="ad226219b7413f6c19c43a404b5a9d36708aa40e" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The standard notation is $I(x)$, but this is easy to confuse with mutual information <a href="#expected-mutual-information">below</a>.</span></span></span> to be the <strong>self-information</strong> of $x$, which is the amount of information gained by <span class="marginnote-outer"><span class="marginnote-ref">receiving</span><label for="5aecb422be00894a86508ac09f6366924a33ad33" class="margin-toggle"> ⊕</label><input type="checkbox" id="5aecb422be00894a86508ac09f6366924a33ad33" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Receiving here can mean, (1) sampling an outcome from a distribution, (2) storing in memory <em>one</em> of its possible states, or (3) viewing with the mind or knowing to be the case one out of the possible cases.</span></span></span> $x$. We will see how a natural definition of $h(x)$ arises from combining these two principles:</p>
<ol>
<li>Quantity of information is a function only of probability of occurrence.</li>
<li>Quantity of information acts like quantity of bits when applied to computer memory.</li>
</ol>
<p>Principle (1) constrains $h$ to the form $h(x) = f(p_X(x))$, and we do not yet know what $f$ should be.</p>
<p>To see why, let’s unpack (1): it implies that messages/events must always come from a distribution, which is what provides the probabilities. Say you receive a message $x$ sampled from probability distribution (function) $p_X : X \to [0, 1]$ over a <span class="marginnote-outer"><span class="marginnote-ref">discrete</span><label for="36d6168a1655cd352523cce0e3a2ac045fc1621d" class="margin-toggle"> ⊕</label><input type="checkbox" id="36d6168a1655cd352523cce0e3a2ac045fc1621d" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Assume all distributions are discrete until the <a href="#shannon-information-for-continuous-distributions">continuous section</a>.</span></span></span> set $X$. Then (1) is saying that $h$ should only <em>look at</em> the probability $p_X(x)$ and not $x$ itself. This is a reasonable requirement, since we want to define information irrespective of the kind of object that $x$ is.</p>
<p>Principle (2) constrains what $f$ should be: <span class="marginnote-outer"><span class="marginnote-ref">$f(p) = -\log_2 p$</span><label for="ff9302aebed53145e61c362eae63cb172e24e20a" class="margin-toggle"> ⊕</label><input type="checkbox" id="ff9302aebed53145e61c362eae63cb172e24e20a" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Though we assume a uniform discrete probability distribution to derive this, we will use this definition of $f$ to generalize the same logic to all probability distributions, which is how we arrive at the final definition of $h$.</span></span></span>, where $p \in [0, 1]$ is a probability value.</p>
<p>To understand (2), consider computer memory. With $N$ bits of memory there are $2^N$ distinguishable states, and only one is the case at one time. Increasing the number of bits exponentially increases the <span class="marginnote-outer"><span class="marginnote-ref">number of counterfactual states</span><label for="b6ce5dab64f8e17412d6f1eecdbc565ff4506d08" class="margin-toggle"> ⊕</label><input type="checkbox" id="b6ce5dab64f8e17412d6f1eecdbc565ff4506d08" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Number of states you could have stored but didn’t.</span></span></span>. In memory terms, receiving a “message” of $N$ bits of memory simply means finding out the state those bits are in. Attaching equal weight to each possibility (i.e. memory state) gives us a <span class="marginnote-outer"><span class="marginnote-ref">special case of the probability distribution we used above</span><label for="89a786f47bc58cb026e252161192577d84784488" class="margin-toggle"> ⊕</label><input type="checkbox" id="89a786f47bc58cb026e252161192577d84784488" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">To see the equivalence between these two notions of information, i.e. more rare equals more informative vs number of counterfactual states (or memory capacity), it is useful to think of the probability distribution as a weighted possibility space, and of the memory states as possibilities.</span></span></span> to define $h$: the <em>uniform</em> distribution, where there are $2^N$ possible states and the weight of single state is $\frac{1}{2^N} = 2^{-N}$.</p>
<!--We intuitively think of the quantity of information stored in memory as the number of bits it has. We have $f(p)$ return $N$ when every possible state has an equal weight of $p=2^{-N}$, because we assume a uniform distribution over $2^N$ states, which is equivalent to how we conceive of computer memory with $N$ bits.-->
<p>Composing $f(p) = -\log_2 p$ with $h(x) = f(p_X(x))$ gives us the full definition of self-information:</p>
<div class="kdmath">$$
h(x) = -\log_2 p_X(x)\,.
$$</div>
<!--*Now the magic happens.* Given that we defined self-information as $h(x) = f(p_X(x))$, and given that we've pinned down $f(p) = -\log_2 p$ for a special case, we've done all the work we need to do to define $h(x)$ for all probability distributions, because nothing in our definition of $f(p)$ actually depends on the particular distribution we used.-->
<h2 id="regarding-notation"><a class="header-anchor" href="#regarding-notation">Regarding notation</a></h2>
<p>From here on, I will use $h(x)$ as a function of message $x$, without specifying the type of $x$. It can be anything: a number, a binary sequence, a string, etc. $f(p)$ is a function of probabilities, rather than messages. So:</p>
<p style="text-align: center;">$h : X \to \R^+$ maps from messages to information,<br />
and $f : [0, 1] \to \R^+$ maps from probabilities to information;</p>
<p>and keep in mind that $h(x) = f(p_X(x))$, so <span class="marginnote-outer"><span class="marginnote-ref">$h$ implicitly assumes</span><label for="ac81d19ce662d5360601c37e5741410b398a15da" class="margin-toggle"> ⊕</label><input type="checkbox" id="ac81d19ce662d5360601c37e5741410b398a15da" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">I may sometimes write $h_X$ to make explicit the dependency of $h$ on $p_X$.</span></span></span> we have a probability distribution over $x$ defined somewhere.</p>
<p>In some places below <span class="marginnote-outer"><span class="marginnote-ref">I’ve written equations in terms of $f$ rather than $h$</span><label for="407f53214d4b89bb8df9485bfcc2f05a23b9bf92" class="margin-toggle"> ⊕</label><input type="checkbox" id="407f53214d4b89bb8df9485bfcc2f05a23b9bf92" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Allow me the slight verbosity now, as you’d probably have had to pore over verbose definitions if I hadn’t.</span></span></span> where I felt it would allow you to grasp things just by looking at the shape of the equation.</p>
<h2 id="bits-not-bits"><a class="header-anchor" href="#bits-not-bits"><em>Bits</em>, not bits</a></h2>
<p>You can get through the above exposition by thinking in terms of computer bits. Now we part ways from the computer bits intuition. Note that this departure occurs when $p_X(x)$ is not a (negative) integer power of two. $h(x)$ will be non-integer, and very likely irrational. What does it mean to have a fraction of a bit? From here on out, it’s better to think of <em>bits</em> as a unit quantifying information, like <em>Joules</em> for energy or <em>kilogram</em> for mass, rather than a count of <span class="marginnote-outer"><span class="marginnote-ref">physical objects</span><label for="9e6ffa3d72e725743de4a6c7caafdd2568ea4337" class="margin-toggle"> ⊕</label><input type="checkbox" id="9e6ffa3d72e725743de4a6c7caafdd2568ea4337" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Specifically a physical medium that stores two distinguishable states, usually labeled “0” and “1”.</span></span></span>. We will continue to call the unit of $h(x)$ a <em>bit</em> out of convention. Like the kilogram and Joule, this unit can be regarded as undefined in absolute terms, but its usage gives it semantic meaning.</p>
<p>So then how is $h$ to be understood? What is the intuition behind this quantity? In short, Shannon bits are an <a href="https://en.wikipedia.org/wiki/Analytic_continuation">analytic continuation</a> of computer bits. Just like how the <a href="https://en.wikipedia.org/wiki/Gamma_function">gamma function</a> extends factorial to continuous values, Shannon bits extend the computer bit to <strong>non-uniform distributions</strong> over a <strong>non-power-of-2</strong> number of counterfactuals. Let me explain these two phrases:</p>
<ul>
<li><strong>non-power-of-2</strong>: We have memory that can store one out of $M$ possibilities, where $M \neq 2^N$. For example, I draw a card from a deck of 52. That card holds $-\log_2 \frac{1}{52} = \log_2 52 \approx 5.70044\ldots$ bits of information. A fractional bit can represent a non-power-of-2 possibility space, and quantifies the log-base conversion factor into base $M$. In this case $-log_{52} x = -\frac{\log_2 x}{\log_2 52}$. Note that it is actually common to use units of information other than base-2. For example a <a href="https://en.wikipedia.org/wiki/Nat_(unit)"><em>nat</em></a> is log-base-e, a <a href="https://en.wikipedia.org/wiki/Ternary_numeral_system"><em>trit</em></a> is base-3, and <a href="https://en.wikipedia.org/wiki/Hartley_(unit)"><em>dit</em> or <em>ban</em></a> is base-10.</li>
<li><strong>non-uniform distributions</strong>: Using the deck of cards example, let’s say we draw from a sub-deck containing all cards with the hearts suit. We’ve reduced the possibility space to a subset of a super-space, in this case size 13, and have reduced the information contained in a given card, $-\log_2 \frac{1}{13} \approx 3.70044\ldots$ bits. You can think of this as assigning a weight to each card: 0 for cards we exclude, and $\frac{1}{13}$ for cards we include. If we make the non-zero weights non-uniform, we now have an interpretational issue: what is the physical meaning of these weights? Thinking of this weight as a probability of occurrence is one way to recover physical meaning, but this is <span class="marginnote-outer"><span class="marginnote-ref">not a requirement</span><label for="02ee609808a0f8d62fa190ce247f9b35f70dd990" class="margin-toggle"> ⊕</label><input type="checkbox" id="02ee609808a0f8d62fa190ce247f9b35f70dd990" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">And probability may not even be an objective property of physical systems in general.</span></span></span>. However, I will <span class="marginnote-outer"><span class="marginnote-ref">call these weights probabilities</span><label for="03b6daf7c965c2773f771111cb006d8764629617" class="margin-toggle"> ⊕</label><input type="checkbox" id="03b6daf7c965c2773f771111cb006d8764629617" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The reason we wish to hold sum of weights fixed to 1 is so that we can consider the information contained in compound events which are sets of elementary events. In other words, think of the card drawn from the sub-deck of 13 as a card from <em>any suit</em>, i.e. the set of 4 cards with the same number. The card represents an equivalence class over card number.</span></span></span>, and the weighted-possibility-spaces distributions, as that is the convention. But keep in mind that these weights do not necessarily represent frequencies of occurrence nor uncertainties. The meaning of probability itself is a subject of debate.</li>
</ul>
<p>The reason we wish to hold sum of weights fixed to 1 is so that we can consider the information contained in compound events which are sets of elementary events. In other words, think of the card drawn from the sub-deck of 13 as a card from <em>any suit</em>, i.e. the set of 4 cards with the same number. The card represents an equivalence class over card number.</p>
<p>Let’s examine some of the properties of $h$ to build further intuition.</p>
<p>First notice that $f(1) = 0$. An event with a probability of 1 contains no information. If $x$ is certain to occur, $x$ is uninformative. Likewise, $f(p) \to \infty$ as $p \to 0$. If $x$ is impossible, it contains infinite information! In general, $h(x)$ goes up as $p_X(x)$ goes down. The less likely an event, the more information it contains. Hopefully this sounds to you like a reasonable property of information.</p>
<p>Next, we can be more specific about how $h$ goes up as $p_X$ goes down. Recall that $f(p) = -\log_2 p$ and $h(x) = f(p_X(x))$, then</p>
<div class="kdmath">$$
f(p/2) = f(p) + 1\,.
$$</div>
<p>If we halve the probability of an event, we add one bit of information to it. That is a nice way to think about our new unit of information. The <em>bit</em> is a halving of probability. Other units can be defined in this way, e.g. the <em>nat</em> is dividing of probability by Euler’s constant e, the <em>trit</em> is a thirding of probability, etc.</p>
<p>Finally, notice that $f(pq) = f(p) + f(q)$. Or to write it another way: $h(x \and y) = h(x) + h(y)$ iff $x$ and $y$ are independent events, because</p>
<div class="kdmath">$$
\begin{align}
h(x \and y) &= -\log_2 p_{X,Y}(x \and y) \\
&= -\log_2 p_X(x)\cdot p_Y(y) \\
&= -\log_2 p_X(x) - \log_2 p_Y(y)\,,
\end{align}
$$</div>
<p>where $x \and y$ indicates the composite event <span class="marginnote-outer"><span class="marginnote-ref">“$x$ and $y$”</span><label for="03c34b59dc7fea1cd51e8dbb51bdcfc9754145fd" class="margin-toggle"> ⊕</label><input type="checkbox" id="03c34b59dc7fea1cd51e8dbb51bdcfc9754145fd" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We could either think of $x$ and $y$ as composite events themselves from the same distribution, i.e. $x$ and $y$ are sets of <a href="https://en.wikipedia.org/wiki/Elementary_event">elementary events</a>, or as elementary events from two different random variables which have a joint distribution, i.e, $(x, y) \sim (\rX, \rY)$. I will consider the latter case from here on out, because it is conceptually simpler.</span></span></span>. Hopefully this is also intuitive. If two events are dependent, i.e. they causally affect each other, it makes sense that they might contain redundant information, meaning that you can predict part of one from the other, and so their combined information is less than the sum of their individual information. You may be surprised to learn that the opposite can also be true. The combined information of two events can be greater than the sum of their individual information! This is called <a href="https://en.wikipedia.org/wiki/Interaction_information#Example_of_negative_interaction_information"><em>synergy</em></a>. More on that in the <a href="#pointwise-mutual-information">pointwise mutual information</a> section.</p>
<p>In short, we can derive $f(p) = -\log_2 p$ from (1) additivity of information, $f(pq) = f(p) + f(q)$, and (2) a choice of unit, $f(½) = 1$. <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)#Rationale">Proof</a>.</p>
<h3 id="recap"><a class="header-anchor" href="#recap">Recap</a></h3>
<p>To make the full analogy: a weighting over possibilities is like a continuous relaxation of a set. An element is or is not in a set, while adding weights to elements (in a larger set) allows their member ship to have degrees, i.e. the <em>“is element”</em> relation becomes a <span class="marginnote-outer"><span class="marginnote-ref">fuzzy value between 0 and 1</span><label for="9c3ecf3b9785719c56a55d9ff267adf0112b2c75" class="margin-toggle"> ⊕</label><input type="checkbox" id="9c3ecf3b9785719c56a55d9ff267adf0112b2c75" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We recover regular sets by setting all weights to either 0 or uniform non-zero weights.</span></span></span>. With a weighted possibility space we have a lot more freedom to work with extra information beyond just merely which possibilities are in the set. Probability distributions are more expressive than mere sets.</p>
<h2 id="stepping-back"><a class="header-anchor" href="#stepping-back">Stepping back</a></h2>
<p>The unit <em>bit</em> that we’ve defined is connected to computer bits only because they both convert multiplication to addition.</p>
<ul>
<li>Computer bits: $(2^N\cdot2^M)$ states $\Longrightarrow$ $(N+M)$ bits.</li>
<li>Shannon bits: $(p\cdot q)$ probability $\Longrightarrow$ $(-\log_2 p - \log_2 q)$ bits.</li>
</ul>
<p>The way I’ve motivated $h$ is a departure from Shannon’s original motivation for defining self-information, which was to describe the theoretically optimal lossless compression for messages being sent over a communication channel. Under this viewpoint, $h(x)$ quantifies the theoretically minimum possible length (in physical bits) to encode message $x$ in computer memory without loss of information. Under this view, $h(x)$ should be thought of as the asymptotic average bit-length for the optimal encoding of $x$ in an infinite sequence of messages drawn from $p_X$. Hence why it makes sense for $h(x)$ to be a continuous value. For more details, see <a href="https://en.wikipedia.org/wiki/Arithmetic_coding#Connections_with_other_compression_methods">arithmetic coding</a>.</p>
<p>We are now flipping Shannon’s original motivation on its head, and using the theoretically optimal encoding length in bits as the definition of information content. In the following discussion, we don’t care how messages/events are actually represented physically. Our definition of information only cares about probability of occurrence, and is in fact <span class="marginnote-outer"><span class="marginnote-ref">blind to the contents of messages</span><label for="e4d483f3b81edaec6cd9e4f482319dd7556b1855" class="margin-toggle"> ⊕</label><input type="checkbox" id="e4d483f3b81edaec6cd9e4f482319dd7556b1855" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Something that could be seen as either a flaw or a virtue, which I discuss <a href="#5-shannon-information-ignores-the-meaning-of-messages">below</a>.</span></span></span>. The connection of probability to optimal physical encoding is one of the beautiful results that propelled Shannon’s framework into its lofty position as <em>information theory</em>. However, for our purposes, we simply care about defining quantity of information, and do not care at all about how best to compress or store data for practical purposes.</p>
<p>To be clear, when I talk about the self-information of a message, I am not saying anything about how the message is physically encoded or transmitted, and indeed it need not be encoded with an optimal number of computer bits. I am merely referring to a <span class="marginnote-outer"><span class="marginnote-ref">quantified</span><label for="a3caf3737e12ebd497b9d5b99c399b7cd6e84ec9" class="margin-toggle"> ⊕</label><input type="checkbox" id="a3caf3737e12ebd497b9d5b99c399b7cd6e84ec9" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Hopefully this quantity is objective and measurable in principle - something I discuss <a href="#6-probability-distributions-are-not-objective">below</a></span></span></span> property of the message, i.e. it’s information content. The number of computer bits a message is encoded with need not equal the <span class="marginnote-outer"><span class="marginnote-ref">number of Shannon bits it contains!</span><label for="a6072afeef1e2f90ee72e61877bd97c2b4450eeb" class="margin-toggle"> ⊕</label><input type="checkbox" id="a6072afeef1e2f90ee72e61877bd97c2b4450eeb" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In short, physical encoding length and probability of occurrence need not be linked.</span></span></span></p>
<h1 id="entropy"><a class="header-anchor" href="#entropy">Entropy</a></h1>
<p>In the last section I said that under the view of optimal lossless compression, $h(x)$ is the bit length of the optimal encoding for $x$ averaged over an infinite sample from random variable $\rX$, and <a href="https://en.wikipedia.org/wiki/Arithmetic_coding#Connections_with_other_compression_methods">arithmetic coding</a> can approach this limit. We could also consider the average bit length per message from $\rX$ (averaged across all messages). That is the <strong>entropy</strong> of random variable $\rX$, which is the expected self-information,</p>
<div class="kdmath">$$
\begin{align}
H[\rX] &= \E_{x\sim \rX}[h(x)] \\
&= \E_{x\sim \rX}[-\log_2\,p_X(x)]\,.
\end{align}
$$</div>
<p>In the quantifying information view, think of entropy $H[\rX]$ as the number of bits you expect to gain by observing an event sampled from $p_X(x)$. In that sense it is a measure of uncertainty, i.e. how much information I do not have, i.e. quantifying what is unknown.</p>
<p>Let’s build our intuition of entropy. A good way to view entropy is as a measure of how spread out a distribution is. Entropy is actually a type of <a href="https://en.wikipedia.org/wiki/Statistical_dispersion">statistical dispersion</a> of $p_X$, meaning you could use it as an <a href="http://zhat.io/articles/19/bias-variance#what-is-variance-anyway">alternative to statistical variance</a>.</p>
<figure><img src="/assets/posts/primer-shannon-information/bimodal.png" alt="" width="100%" /><figcaption></figcaption></figure>
<p>For example, a bi-modal distribution can have arbitrarily high variance by moving the modes far apart, but the overall spread-out-ness (entropy) will not necessarily change.</p>
<p>The more spread out a distribution is, the higher its entropy. For bounded <a href="https://en.wikipedia.org/wiki/Support_(mathematics)#Support_of_a_distribution">support</a>, the uniform distribution has highest entropy (<a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution#Other_examples.">other max-entropy distributions</a>). The <span class="marginnote-outer"><span class="marginnote-ref">minimum possible entropy is 0</span><label for="61d3d043a10d74a21176f1255203a29c26b0f6f4" class="margin-toggle"> ⊕</label><input type="checkbox" id="61d3d043a10d74a21176f1255203a29c26b0f6f4" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Note that in the expectation, 0-probability outcomes have infinite self-information, so we have to use the convention that $p_X(x)\cdot h(x) = 0\cdot\infty = 0$.</span></span></span>, which indicates a deterministic distribution, i.e. $p_X(x) \in \{0, 1\}$ for all $x \in X$.</p>
<figure><img src="https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg" alt="Credit: <a href="https://en.wikipedia.org/wiki/Binary_entropy_function">https://en.wikipedia.org/wiki/Binary_entropy_function</a>" width="100%" /><figcaption>Credit: <a href="https://en.wikipedia.org/wiki/Binary_entropy_function">https://en.wikipedia.org/wiki/Binary_entropy_function</a></figcaption></figure>
<p>Though Shannon calls his new idea entropy, the connection to physical entropy is nontrivial. If there is a connection, that is more of a coincidence. Apparently Shannon’s decision to call it entropy was made by a suggestion by von Neumann at a party: http://www.eoht.info/page/Neumann-Shannon+anecdote<br />
[credit: Mark Moon]</p>
<p>There are connections between information entropy and thermodynamics entropy (see https://plato.stanford.edu/entries/information-entropy/), but I do not yet understand them well enough to give an overview here - perhaps in a future post. Some physicists consider information to have a physical nature, and even a <span class="marginnote-outer"><span class="marginnote-ref">conservation law</span><label for="4cd9a9099f83eeb0f5a534d111b0875861d2c3ec" class="margin-toggle"> ⊕</label><input type="checkbox" id="4cd9a9099f83eeb0f5a534d111b0875861d2c3ec" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In the sense that requiring physics be time-symmetric is equivalent to requiring information to be conserved.</span></span></span>! Further reading: <a href="https://theoreticalminimum.com/courses/statistical-mechanics/2013/spring/lecture-1">The Theoretical Minimum - Entropy and conservation of information</a>, <a href="https://en.wikipedia.org/wiki/No-hiding_theorem">no-hiding theorem</a>.</p>
<p><strong>Question</strong>: Why expected self-information?<br />
We could have used median or something else. Expectation is a <span class="marginnote-outer"><span class="marginnote-ref">default go-to operation over distributions</span><label for="e3be61e72d21ae86483befb994084998b611b8df" class="margin-toggle"> ⊕</label><input type="checkbox" id="e3be61e72d21ae86483befb994084998b611b8df" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">See my previous post: <a href="http://zhat.io/articles/bias-variance#bias-variance-decomposition-for-any-loss">http://zhat.io/articles/bias-variance#bias-variance-decomposition-for-any-loss</a></span></span></span> because of its nice properties, but ultimately it is an arbitrary choice. However, as we will see, one huge benefit in our case is that expectation is linear.</p>
<h3 id="regarding-notation-1"><a class="header-anchor" href="#regarding-notation-1">Regarding notation</a></h3>
<p>From here on out, I will drop the subscript $X$ from $p_X(x)$ when $p(x)$ unambiguously refers to the probability of $x$. This is a common thing to do, but it can also lead to ambiguity if I want to write $p(0)$, the probability that $x$ is 0. A possible resolution is to use random variable notation, $p(\rX = 0)$, which I use in some places. However, there is the same issue for self-information. For example, quantities $h(x), h(y), h(x\and y), h(y \mid x)$. I will add subscripts to $h$ when it would be ambiguous otherwise, for example $h_X(0), h_Y(0), h_{X,Y}(x\and y), h_{Y\mid X}(0 \mid 0)$ .</p>
<h2 id="conditional-entropy"><a class="header-anchor" href="#conditional-entropy">Conditional Entropy</a></h2>
<p>Conditional self-information, defined as</p>
<div class="kdmath">$$
\begin{align}
h(y \mid x) &= -\log_2\,p(y \mid x)\\
&= -\log_2(p(y \and x) / p(x)) \\
&= h(x \and y) - h(x)\,,
\end{align}
$$</div>
<p>is the information you stand to gain by observing $y$ given that you already observed $x$. I let $x \and y$ denote the observation of $x$ and $y$ together (I could write $(x, y)$, but then $p((y, x))$ would look awkward).</p>
<p>If $x$ and $y$ are independent events, $h(y \mid x) = h(y)$. Otherwise, $h(y \mid x)$ can be greater or less than $h(y)$. It may seem counterintuitive that $h(y \mid x) > h(y)$ can happen, because this implies you gain more from $y$ by just simply knowing something else, $x$. However, this reflects the fact that you are unlikely to see $x, y$ together. Likewise, if $h(y \mid x) < h(y)$ you are likely to see $x, y$ together. More on this in the next section.</p>
<p>Confusingly, conditional entropy can refer to two different things.</p>
<p>First is expected conditional self-information,</p>
<div class="kdmath">$$
\begin{align}
H[\rY \mid \rX = x] &= \E_{y\sim \rY \mid \rX=x}[h(y \mid x)] \\
&= \E_{y\sim \rY \mid \rX=x}[\log_2\left(\frac{p(x)}{p(x, y)}\right)] \\
&= \sum\limits_{y \in Y} p(y \mid x) \log_2\left(\frac{p(x)}{p(x, y)}\right)\,.
\end{align}
$$</div>
<p>The other is what is most often referred to as <strong>conditional entropy</strong>,</p>
<div class="kdmath">$$
\begin{align}
H[\rY \mid \rX] &= \E_{x,y \sim \rX,\rY}[h(y \mid x)] \\
&= \E_{x,y \sim \rX,\rY}[\log_2\left(\frac{p(x)}{p(x, y)}\right)] \\
&= \E_{x\sim \rX} H[\rY \mid \rX = x]\,.
\end{align}
$$</div>
<p>The intuition behind $H[\rY \mid \rX = x]$ will be the same as of entropy, $H[\rY]$, which we covered in the last section. Let’s gain some intuition for $H[\rY \mid \rX]$. If $H[\rY]$ measures uncertainty of $\rY$, then $H[\rY \mid \rX = x]$ measures conditional uncertainty given $x$, and $H[\rY \mid \rX]$ measures average conditional uncertainty w.r.t. $\rX$.</p>
<p>The maximum value of $H[\rY \mid \rX]$ is $H[\rY]$, which is achieved when $\rX$ and $\rY$ are independent random variables. This should make sense, as recieving a message from $\rX$ does not tell you anything about $\rY$, so your state of uncertainty does not decrease.</p>
<p>The minimum value of $H[\rY \mid \rX]$ is 0, which is achieved when $p_{\rY \mid \rX}(\rY \mid \rX = x)$ is deterministic for all $x$. In other words, you can define a function $g : X \rightarrow Y$ to map from $X$ to $Y$. This wouldn’t otherwise be the case when $\rY \mid \rX$ is stochastic.</p>
<p>$H[\rY \mid \rX]$ is useful because it takes all $x \in X$ into consideration. You might have, for example, $H[\rY \mid \rX = x_1] = 0$ for $x_1$, but $H[\rY \mid \rX] > 0$, which means $y$ cannot always be deterministically decided from $x$. In the section on mutual information we will see how to think of $H[\rY \mid \rX]$ as a property of a stochastic function from $X$ to $Y$.</p>
<p>Because of linearity of expectation, all identities that hold for self-information hold for their entropy counterparts. For example,</p>
<div class="kdmath">$$
\begin{align}
h(y \mid x) &= h(x \and y) - h(x) \\
\Longrightarrow H[\rY \mid \rX] &= H[(\rX, \rY)] - H[\rX]\,.
\end{align}
$$</div>
<p>This is a nice result. This equation says that the <span class="marginnote-outer"><span class="marginnote-ref">average uncertainty about $\rY$ given $\rX$</span><label for="40b324ca64f32166d198598020cd16fd3a369058" class="margin-toggle"> ⊕</label><input type="checkbox" id="40b324ca64f32166d198598020cd16fd3a369058" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Amount of information left to observe in $\rY$ on average.</span></span></span> equals the total expected information in their joint distribution, $(\rX, \rY)$, minus the average information in $\rX$. In other words, conditional entropy is the total information in $x \and y$ minus information in what you have, $x$, all averaged over all the possible $(x, y)$ you can have.</p>
<h1 id="mutual-information"><a class="header-anchor" href="#mutual-information">Mutual Information</a></h1>
<p>In my view, mutual information is what holds promise as a definition of information. This it the most important topic to understand for tackling the <a href="#problems-with-shannon-information">problems with Shannon information</a> section below.</p>
<h2 id="pointwise-mutual-information"><a class="header-anchor" href="#pointwise-mutual-information">Pointwise Mutual Information</a></h2>
<!-- Intuitively, if two events are causally connected, i.e. dependent, they contain redundant information combined. meaning that their combined information would be less than the sum of their information. It may also be the case that their combined information could be greater than the sum of their information! This is called *synergy*. We will see examples of this later. -->
<p>When two events $x$ and $y$ are dependent, how do we compute their total information? Previously we said that $h(x \and y) = h(x) + h(y)$ iff $p_X(x \and y) = p_X(x)p_X(y)$. However, the general case is,</p>
<div class="kdmath">$$
h(x \and y) = h(x) + h(y) - i(x, y)\,,
$$</div>
<p>where I am defining $i(x, y)$ such that this equation holds. Rearranging we get</p>
<div class="kdmath">$$
\begin{align}
i(x, y) &= h(x) + h(y) - h(x \and y) \\
&= -\log_2(p_X(x)) - \log_2(p_X(y)) + \log_2(p_X(x \and y)) \\
&= \log_2\left(\frac{p_X(x, y)}{p_X(x)p_X(y)}\right)\,.
\end{align}
$$</div>
<p>$i(x, y)$ is called <em>pointwise mutual information</em> (PMI). Informally, PMI measures the amount of bits shared by two events. To say that another way, it measures how much information I have about one event given I only observe the other. Notice that PMI is symmetric, $i(x, y) = i(y, x)$, so any two events contain the same information about each other.</p>
<p>$i(x, y)$ is a difference in information. Positive $i(x, y)$ indicates <em>redundancy</em>, i.e. total information is <span class="marginnote-outer"><span class="marginnote-ref">less than the sum of the parts</span><label for="03c34736803c8e2efd1383baf17104684eeab01a" class="margin-toggle"> ⊕</label><input type="checkbox" id="03c34736803c8e2efd1383baf17104684eeab01a" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">If you object that it doesn’t make sense to lose information by observing $x$ and $y$ together over observing them separately, it is important to note that $h(x) + h(y)$ is not a physically meaningful quantity, unless they are independent. Technically, you would have $h(x) + h(y \mid x)$ in total. $h(x)$ and $h(y)$ are both the amounts of information to gain by observing either $x$ or $y$ <strong>first</strong>.</span></span></span>: $h(x, y) < h(x) + h(y)$. However, it may also be the case that $i(x, y)$ is negative so that $h(x, y) > h(x) + h(y)$. <span class="marginnote-outer"><span class="marginnote-ref">This is called <a href="https://en.wikipedia.org/wiki/Synergy#Information_theory"><em>synergy</em></a>.</span><label for="4ce5fcaeebe9d13d3fd8d49c2de4ba62a623e205" class="margin-toggle"> ⊕</label><input type="checkbox" id="4ce5fcaeebe9d13d3fd8d49c2de4ba62a623e205" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The word <em>synergy</em> is conventionally used in the context of expected mutual information, and I am running the risk of conflating two distinct phenomenon under the same word. There is no synergy among two random variables under expected mutual information, and this type of synergy only appears among 3 or more random variables. See <a href="https://en.wikipedia.org/wiki/Multivariate_mutual_information#Synergy_and_redundancy">https://en.wikipedia.org/wiki/Multivariate_mutual_information#Synergy_and_redundancy</a>.</span></span></span></p>
<p>This is highly speculative, but synergy (either the pointwise-MI or expected-MI kind) may be a fundamental insight that could explain <span class="marginnote-outer"><span class="marginnote-ref">emergence</span><label for="c55207622e4799bd3610c39427659c03f63135f5" class="margin-toggle"> ⊕</label><input type="checkbox" id="c55207622e4799bd3610c39427659c03f63135f5" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Emergence is a concept in philosophy. See <a href="https://en.wikipedia.org/wiki/Emergence">https://en.wikipedia.org/wiki/Emergence</a> and <a href="https://plato.stanford.edu/entries/properties-emergent/">https://plato.stanford.edu/entries/properties-emergent/</a></span></span></span> and possible limitations of reductionism in illuminating reality. See <a href="https://www.scottaaronson.com/blog/?p=3294">Higher-level causation exists (but I wish it didn’t)</a>.</p>
<!--
Let's look at (an admittedly contrived) example of synergy. Suppose our [sample space](https://en.wikipedia.org/wiki/Sample_space) is $\\{a, b, c\\}$ (composed of [coutcomes](https://en.wikipedia.org/wiki/Sample_space#Conditions_of_a_sample_space) or [elementary events](https://en.wikipedia.org/wiki/Elementary_event)), and we have two events $x = \\{a, b\\}$ and $y = \\{b, c\\}$. $x, y$ co-occur if we draw outcome $b$. If $p(a) = 7/16, p(b) = 1/8, p(c) = 7/16$, then $p(x) = 9/16$, $p(y) = 9/16$, $p(x \and y) = 1/8$. $h(x) = h(y) \approx 0.83$ and $h(x \and y) = 3$, so $i(x, y) = 2\cdot0.83 - 3 \approx -1.34$ bits.
That may seem like a contrived example, because I was working with composite events instead of elementary events. The same phenomenon can happen for joint distributions of sample spaces. <font color="red">TODO: explain the difference between the example above and a joint distribution.</font>
-->
<p><strong>Example:</strong><br />
Let $X = \{0, 1\}$ and $Y = \{0, 1\}$, then the joint sample space is the cartesian product $X \times Y$. $p_X(x), p_Y(y)$ denote marginal probabilities, and $p_{X,Y}(x, y)$ is their joint probability. The joint probability table:</p>
<table>
<thead>
<tr>
<th style="text-align: center">$x$</th>
<th style="text-align: center">$y$</th>
<th style="text-align: center">$p_{X,Y}(x, y)$</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1/16</td>
</tr>
<tr>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">7/16</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">7/16</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1/16</td>
</tr>
</tbody>
</table>
<p>We have</p>
<ul>
<li>$h_X(0) = -\log_2 p_X(0) = -\log_2 1/2 = 1$</li>
<li>$h_Y(0) = -\log_2 p_Y(0) = -\log_2 1/2 = 1$</li>
<li>$h_{X,Y}(0 \and 0) = -\log_2 p_{X,Y}(0, 0) = -\log_2 1/16 = 4$</li>
</ul>
<p>$i(0,0) = h_X(0) + h_Y(0) - h_{X,Y}(0 \and 0) = -2$, and so $(0,0)$ is synergistic. On the other hand, $i(0,1) \approx 0.80735$, indicating $(0,1)$ is redundant.</p>
<h2 id="properties-of-pmi"><a class="header-anchor" href="#properties-of-pmi">Properties of PMI</a></h2>
<p>Let’s explore some of the properties of PMI. From here on out, I will consider sampling elementary events from a joint distribution, $(x, y) \sim (\rX, \rY)$, where $\rX, \rY$ are unspecified discrete (possibly infinite) random variables. For notational simplicity I’ll drop the subscripts from distributions, so $p(x), p(y)$ denote the marginals, $\rX$ and $\rY$ respectively, and $p(x, y)$ denotes the joint $(\rX,\rY)$.</p>
<p>To recap, PMI measures the difference in bits between the product of marginals $p(x)p(y)$ and the joint $p(x, y)$, as evidenced by</p>
<div class="kdmath">$$
\begin{align}
i(x, y) &= \log_2\left(\frac{p(x, y)}{p(x)p(y)}\right) \\
&= h(x) + h(y) - h(x \and y)\,.
\end{align}
$$</div>
<p>Negative PMI implies synergy, while positive PMI implies redundancy.</p>
<p>Another way to think about PMI is as a measure of how much $p(y \mid x)$ differs from $p(y)$ (and vice versa). Suppose an oracle sampled $(x, y) \sim (\rX,\rY)$, but the outcome $(x, y)$ remains hidden from you. $p(y)$ is the information you stand to gain by having $y$ revealed to you. However, $p(y \mid x)$ is what you stand to gain from seeing $y$ if $x$ is already revealed. You do not know how much information $x$ contains about $y$ without seeing $y$. Only the oracle knows this. However, if you know $p(y \mid x)$, then you can compute your expected information gain (conditional uncertainty), $H[\rY \mid \rX=x]$.</p>
<p>PMI measures the change in information you will gain about $y$ (from the oracle’s perspective) before and after $x$ is revealed (and vice versa). In this view, it makes sense to rewrite PMI as</p>
<div class="kdmath">$$
\begin{align}
i(x, y) &= \log_2\left(\frac{p(y \mid x)}{p(y)}\right) \\
&= -\log_2\,p(y) + \log_2\,p(y \mid x) \\
&= h(y) - h(y \mid x)\,.
\end{align}
$$</div>
<h3 id="special-values"><a class="header-anchor" href="#special-values">Special Values</a></h3>
<p>By definition, $i(x, y) = 0$ iff $\rX, \rY$ are independent. Verifying, we see that,</p>
<div class="kdmath">$$
\begin{align}
i(x, y) &= \log_2\left(\frac{p(x)p(y)}{p(x)p(y)}\right) \\
&= 0\,.
\end{align}
$$</div>
<p>The maximum possible PMI happens when $x$ and $y$ are perfectly associated, i,e. $p(y \mid x) = 1$ or $p(x \mid y) = 1$. So $h(y \mid x) = 0$ or vice versa, meaning you know everything about $y$ if you have $x$. Then $i(x, y) = h(y) - h(y \mid x) = h(y)$. In general, the maximum possible PMI is $\min\{h(x), h(y)\}$.</p>
<p>PMI has no minimum, and goes to $-\infty$ if $x$ and $y$ can never occur together but can occur separately, i.e. $p(x, y) = 0$ while $p(x), p(y) > 0$. We can see that $p(y \mid x) = p(x, y)/p(x) = 0$ so long as $p(x) > 0$. So $h(y \mid x) \to \infty$, and we have $i(x, y) = h(y) - h(y \mid x) \to -\infty$ if $h(y) > 0$.</p>
<p>While redundancy is bounded, synergy is infinite. This should make sense, as $h(x), h(y)$ are bounded so there is a maximum amount of information to redundantly share. On the other hand, synergy measures how rare the co-occurrence of $(x,y)$ together are, relative to their marginal probabilities, where lower $p(x, y)$ means their co-occurrence is more special. So if $(x,y)$ can never occur, then their co-occurrence is infinitely special.</p>
<h2 id="expected-mutual-information"><a class="header-anchor" href="#expected-mutual-information">Expected Mutual Information</a></h2>
<p>Expected mutual information, also just called mutual information (MI), is given as</p>
<div class="kdmath">$$
\begin{align}
I[\rX, \rY] &= \E_{x\sim X, y\sim Y}[i(x, y)] \\
&= \E_{x\sim X, y\sim Y}\left[\log_2\left(\frac{p(x, y)}{p(x)p(y)}\right)\right]\,.
\end{align}
$$</div>
<p>$I$ is to correlation as $H$ is to variance. While correlation measures to what extent $\rX$ and $\rY$ have a <a href="https://en.wikipedia.org/wiki/Correlation_and_dependence">linear relationship</a>, $I$ measures the strength of their statistical dependency. While variance measures average distance from some critical point, $H$ is distance agnostic, i.e. it measures unordered dispersion. Similarly, while statistical correlation measures deviation of the mapping between $\rX$ and $\rY$ from perfectly linear, $I$ is shape agnostic, i.e. it measures unordered causal dependence.</p>
<p>First off, it is important to point out that $I$ is always non-negative, unlike its pointwise counterpart (proof <a href="https://math.stackexchange.com/a/159544">here</a>). You can see this intuitively by trying to construct an anti-dependent relationship between $\rX$ and $\rY$. On average, $p(x, y)$ would have to be less than the product of their marginals. You can construct individual cases where this is true for a particular $(x, y)$, but to do that, you will have to fill most of the probability table (for 2D joint) with p-mass to compensate. This is reflected in Jensen’s inequality. A direct consequence is $H[\rY] \geq H[\rY \mid \rX]$.</p>
<p>$I$ being non-negative means you can safely think about it as a measure of information content. In this sense, information is stored in the relationship between $\rX$ and $\rY$.</p>
<p>Note that by remembering that expectation is linear, some useful identities pop out of the definition above,</p>
<div class="kdmath">$$
\begin{align}
I[\rX, \rY] &= H[\rX] + H[\rY] - H[(\rX,\rY)] \\
&= H[\rX] - H[\rX \mid \rY] \\
&= H[\rY] - H[\rY \mid \rX]\,.
\end{align}
$$</div>
<p>An intuitive way to think about $I$ is as a continuous measure of <em>bijectivity</em> of the stochastic function, $g(x) \sim p(\rY \mid \rX = x)$, where $g : X \rightarrow Y$. This is easier to see if we write</p>
<div class="kdmath">$$
I[\rX, \rY] = H[\rY] - H[\rY \mid \rX]\,.
$$</div>
<p>$H[\rY]$ measures <em>surjectivity</em>, i.e. how much $g$ spreads out over $Y$ (marginalized over $\rX$). <em>surjective</em> (a.k.a. onto) in the set-theory sense means that $g$ maps to every element in $Y$. In the statistical sense, $g$ may map to every element in $Y$ with some probability, but to some elements much more frequently than others. We would say $p(y)$ is <em>peaky</em>, the opposite of spread out. Recall that $H$ measures statistical dispersion. Larger $H[\rY]$ means more even spread of probability mass across all the elements in $Y$. In that sense, it measures how surjective $g$ is.</p>
<p>$H[\rY \mid \rX]$ measures <em>anti-injectivity</em>. <em>injective</em> (a.k.a. one-to-one) in the set-theory sense means that $g$ maps every element in $X$ to a unique element in $Y$. There is no sharing, and you know which $x \in X$ was the input for any $y \in Y$ in the image of $g(X)$. In the statistical sense, $g$ may map a given $x$ to many elements in $Y$, each with some probability, i.e. fan-out. Anti-injective is like a reversal of injective, which is about fan-in. The more $g$ fans-out, the more anti-injective it is. Recall that $H[\rY \mid \rX]$ measures averge uncertainty about $\rY$ given an observation from $\rX$. This is, in a sense, the average statistical fan-out of $g$. Lower $H[\rY \mid \rX]$ means $g$’s output is more concentrated (peaky) on average for a given $x$, and higher means its output is more uniformly spread on average.</p>
<p>For a function to be a bijection, is needs to be both injective and surjective. $H[\rY]$ may seem like a good continuous proxy for surjectivity, but $H[\rY \mid \rX]$ seems to measure something different from injectivity. Notice that $H[\rY \mid \rX]$ is affected by the injectivity of $g^{-1}$. If $g^{-1}$ maps many $y$s to the same $x$, then we are uncertain about what $g(x)$ should be.</p>
<p>In general, I claim that $I[\rX, \rY]$ measures how bijective $g$ is. $I[\rX, \rY]$ is maximized when $H[\rY]$ is maximized and $H[\rY \mid \rX]$ is minimized (i.e. 0). That is, when $g$ is maximally surjective and minimally anti-injective, implying it is maximally injective. Higher $I[\rX, \rY]$ actually does indicate that $g$ is more invertible because $I$ is symmetric. It measures how much information can flow through $g$ in either direction.</p>
<figure><img src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg" alt="Useful diagram for keeping track of the relationships between these concepts.<br/>Credit: <a href="https://en.wikipedia.org/wiki/Mutual_information">https://en.wikipedia.org/wiki/Mutual_information</a>" width="100%" /><figcaption>Useful diagram for keeping track of the relationships between these concepts.<br />Credit: <a href="https://en.wikipedia.org/wiki/Mutual_information">https://en.wikipedia.org/wiki/Mutual_information</a></figcaption></figure>
<p>Useful diagram for keeping track of the relationships between these concepts</p>
<h3 id="channel-capacity"><a class="header-anchor" href="#channel-capacity">Channel capacity</a></h3>
<p>$I$ is determined by $p(x)$ just as much as $p(y \mid x)$, but $g$ has ostensibly nothing to do with $p(x)$. If we want $I$ to measure properties of $g$ in isolation, it should not care about the distribution over its inputs. One solution to this issue is to use the <a href="https://en.wikipedia.org/wiki/Channel_capacity#Formal_definition"><strong>capacity</strong></a> of $g$, defined as</p>
<div class="kdmath">$$
\begin{align}
C[g] &= \sup_{p_X(x)} I[\rX, \rY] \\
&= \sup_{p_X(x)} \E_{y\sim p_{g(x)}, x \sim p_X}[i(x, y)] \\
&= \sup_{p_X(x)} \E_{y\sim p_{Y \mid X=x}, x \sim p_X}[-h(y \mid x) + h(x)]\,.
\end{align}
$$</div>
<p>In other words, if you don’t have a preference for $p(x)$, choose $p(x)$ which maximizes $I[\rX, \rY]$.</p>
<h1 id="shannon-information-for-continuous-distributions"><a class="header-anchor" href="#shannon-information-for-continuous-distributions">Shannon Information For Continuous Distributions</a></h1>
<p>Up to now we’ve only considered discrete distributions. Describing the information content in continuous distributions and their events is tricky business, and a bit more nuanced than usually portrayed. Let’s explore this.</p>
<p>For this discussion, let’s consider a random variable $\rX$ with <a href="https://en.wikipedia.org/wiki/Support_(mathematics)#Support_of_a_distribution">support</a> over $\R$. Let $f(x)$ be the probability density function (pdf) of $\rX$.</p>
<p>Elementary events $x \in \R$ <span class="marginnote-outer"><span class="marginnote-ref">do not have probabilities perse</span><label for="81ebfb9082ef06c6091bdf79ef229fe6555037fd" class="margin-toggle"> ⊕</label><input type="checkbox" id="81ebfb9082ef06c6091bdf79ef229fe6555037fd" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">you could say their probability mass is 0 in the limit</span></span></span>. Self-information is a function of probability mass, so we should instead compute self-info of events that are intervals (or measurable sets) over $\R$. For example,</p>
<div class="kdmath">$$
\begin{align}
h(a < x < b) &= -\log_2\,p(a < \rX < b)\\
&= -\log_2\left(\int_a^b f(x) \diff x\right)
\end{align}
$$</div>
<p>Conjecture: The entropy of any distribution with uncountable support is infinite. This should make sense, as we now have uncountably many possible outcomes. One observation rules out infinitely many alternatives, so it should contain infinite information. We can see this clearly because the entropy of a uniform distribution over $N$ possibilities is $\log_2 N$ which grows to infinity as $N$ does. On the other hand, a one-hot distribution over $N$ possibilities has 0 entropy, because you will <span class="marginnote-outer"><span class="marginnote-ref">always observe</span><label for="2b312968d2ec3765f14ce161dd47e1212e3f03cc" class="margin-toggle"> ⊕</label><input type="checkbox" id="2b312968d2ec3765f14ce161dd47e1212e3f03cc" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Unless you observe an impossible outcome, in which case you gain infinite information!</span></span></span> the probability-1 outcome and gain 0 information. So we expect the Dirac-delta distribution to have 0 entropy.</p>
<p>But wait, the Gaussian distribution is a <a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution">maximum-entropy</a> distribution. That people can say “a continuous distribution has maximum entropy” implies their entropies can be numerically compared! And frankly, people talk about entropy of continuous distributions all the time, and they are very much finite! It turns out, what people normally call entropy for continuous distributions is actually <a href="https://en.wikipedia.org/wiki/Differential_entropy">differential entropy</a>, which is not the same thing as the $H$ we’ve been working with.</p>
<p>I’ll show that $H[\rX]$ is infinite when the distribution has continuous support, <span class="marginnote-outer"><span class="marginnote-ref">following a similar proof in <a href="https://www.crmarsh.com/static/pdf/Charles_Marsh_Continuous_Entropy.pdf">Introduction to Continuous Entropy</a></span><label for="6a33d3d476c15c73c5fec235f81ff2c69eec7113" class="margin-toggle"> ⊕</label><input type="checkbox" id="6a33d3d476c15c73c5fec235f81ff2c69eec7113" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">And also in Elements of Information Theory section 8.3.</span></span></span>. To do that, let’s take a <a href="https://en.wikipedia.org/wiki/Riemann_sum">Riemann sum</a> of $f(x)$. Let $\{x_i\}_{i=-\infty}^\infty$ be a set of points equally spaced by intervals of $\Delta$.</p>
<div class="kdmath">$$
% \def\u{\Delta x}
\def\u{\Delta}
\begin{align}
H[\rX] &= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(f(x_i) \u\right) \\
&= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(f(x_i)\right) - \lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(\u\right)\,.
\end{align}
$$</div>
<p>The left term is just the Riemann integral of $f(x)\log_2(f(x))$, which I will define as <span class="marginnote-outer"><span class="marginnote-ref"><strong>differential entropy</strong></span><label for="642b79d7e276fc9865eef223339f4eb4b1a72b14" class="margin-toggle"> ⊕</label><input type="checkbox" id="642b79d7e276fc9865eef223339f4eb4b1a72b14" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Typically $h$ is used to denote differential entropy, but I’ve already used it for self-information, so I’m using $\eta$ instead.</span></span></span>:</p>
<div class="kdmath">$$
\eta[f] := -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(f(x_i)\right) = -\int_{-\infty}^\infty f(x) \log_2\left(f(x)\right) \diff{x}\,.
$$</div>
<p>The right term can be simplified using the <a href="https://tutorial.math.lamar.edu/Classes/CalcI/LimitsProperties.aspx">limit product rule</a>:</p>
<div class="kdmath">$$
-\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(\u\right) = -\left(\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u\right)\cdot\left(\lim\limits_{\u \to 0}\log_2\left(\u\right)\right)\,.
$$</div>
<p>Note that</p>
<p><span class="kdmath">$\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u = \int_{-\infty}^\infty f(x) \diff{x} = 1\,,$</span><br />
because $f(x)$ is a p.d.f.</p>
<p>Putting it all together we have</p>
<div class="kdmath">$$
H[\rX] = \eta[f] - \lim\limits_{\u \to 0}\log_2\left(\u\right)\,.
$$</div>
<p>$\log_2(\u) \to -\infty$ as $\u \to 0$, so $H[\rX]$ explodes to infinity when $\eta[f]$ is finite, which it is for most well-behaved functions.</p>
<p>A simple proof that $H$ is finite for continuous distributions with support over an finite set: the Riemann sum above will only have at most finitely many non-zero terms as $\Delta \to \infty$.</p>
<p>Differential entropy is very different from entropy. It can be unboundedly negative. For example, the differential entropy of a Gaussian distribution with variance $\sigma^2$ is $\frac{1}{2}\ln(2\pi e \sigma^2)$. Taking the limit as $\sigma \to 0$, we see the differential entropy of the <span class="marginnote-outer"><span class="marginnote-ref">Dirac-delta distribution is $-\infty$</span><label for="db0f66f8ebc12d35b9801d084e4ee4afbd893dc5" class="margin-toggle"> ⊕</label><input type="checkbox" id="db0f66f8ebc12d35b9801d084e4ee4afbd893dc5" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Plugging $\eta[f] = -\infty$ into our relation $H[\rX] = \eta[f] - \lim\limits_{\u \to 0}\log_2\left(\u\right)$, we see why entropy of $\delta(x)$ would be 0.</span></span></span>. A notable problem with differential entropy is that its not invariant to change of coordinates, and there is a proposed fix for that: <a href="https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points">https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points</a>.</p>
<h2 id="proof-that-mi-is-fininte-for-continuous-distributions"><a class="header-anchor" href="#proof-that-mi-is-fininte-for-continuous-distributions">Proof that MI is fininte for continuous distributions</a></h2>
<p>A very nice result is that expected mutual information is finite where entropy would be infinite, so long as there is some amount of noise between the two random variables. This implies that even if physical processes are continuous and contain infinite information, we can only get finite information out of them, because measurement requires establishing a statistical relation between the measurement device and that system which is always noisy in reality. MI is agnostic to discrete or continuous universes! As long as there is some amount of noise in between a system and your measurement, your measurement will contain finite information about the system.</p>
<p>The proof follows the same Riemann sum approach from the previous section. I will show that mutual information and differential mutual information are equivalent. Since differential mutual information finite for well behaved functions, so is mutual information!</p>
<div class="kdmath">$$
\begin{align}
I[\rX, \rY] &= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty \sum\limits_{j=-\infty}^\infty f_{XY}(x_i, y_j) \u^2 \log_2\left(\frac{f_{XY}(x_i, y_j)\u^2}{f_X(x_i)\u f_Y(y_i)\u} \right) \\
&= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty \sum\limits_{j=-\infty}^\infty f_{XY}(x_i, y_j) \u^2 \log_2\left(\frac{f_{XY}(x_i, y_j)}{f_X(x_i)f_Y(y_i)} \right) \\
&= \int_{-\infty}^\infty \int_{-\infty}^\infty f_{XY}(x_i, y_j) \log_2\left(\frac{f_{XY}(x_i, y_j)}{f_X(x_i)f_Y(y_i)} \right) \diff{y}\diff{x}\,
\end{align}
$$</div>
<p>because the $\Delta$s cancel inside the log.</p>
<p>If $p(\rY \mid \rX = x)$ is a Dirac-delta for all $x$, and $p(\rY)$ has continuous support, then $I[\rX, \rY]= H[\rY] - H[\rY \mid \rX] = \infty$ because $H[\rY]=\infty$ and $H[\rY \mid \rX]=0$. Thus some noise between $\rX$ and $\rY$ is required to make the MI finite. It follows that $I[\rX, \rX] = H[\rX] = \infty$ when $\rX$ has continuous support.</p>
<h1 id="problems-with-shannon-information"><a class="header-anchor" href="#problems-with-shannon-information">Problems With Shannon Information</a></h1>
<p><strong>Question:</strong> Do the concepts just outlined capture our colloquial understanding of information? Are there situations where they behave differently from how we expect information to behave? I’ll go through some fairly immediate objections to this Shannon’s definition of information, and some remedies.</p>
<h2 id="1-tv-static-problem"><a class="header-anchor" href="#1-tv-static-problem">1. TV Static Problem</a></h2>
<p>Imagine a TV displaying static noise. If we assume a fairly uniform distribution over all “static noise” images, we know that the entropy of the TV visuals will be high, because probability mass is spread fairly evenly across all possible images. Each image on average has a very low probability of occurring. According to Shannon, each image then contains a large amount of information.</p>
<p>That may sound absurd. <a href="https://en.wikipedia.org/wiki/Noise_(signal_processing)">Noise</a>, by some definitions, carries no useful information. Noise is uninformative. To a human looking at TV static, the information gained is that the TV is not displaying anything. This is a very high level piece of information, but much less than the supposedly high information content of the static itself.</p>
<figure><img src="/assets/posts/primer-shannon-information/tv-static.png" alt="" width="100%" /><figcaption></figcaption></figure>
<p>The resolution here is to define what it means for a human to obtain information. I propose looking at the mutual information between the TV and the viewer’s brain. Let $\rX$ be a random variable over TV images, and $\rZ$ be a random variable over the viewer’s brain states. The support of $\rX$ is the space of all possible TV screens, so static and SpongeBob are just different distributions over the same space. Now, the state of the viewer’s brain is causally connected to what is on the TV screen, but the nature of their visual encoder (visual cortex) determines $p(\rZ \mid \rX)$, and thus $p(\rZ, \rX)$. I would guess that any person who says TV static is uninformative does not retain much detail about the patterns in the static. Basically, that person would just remember that they saw static. What we have here is a region of large fan-in. Many static images are collapsed to a single output for their visual encoder, namely the label “TV noise”. So the information contained in TV static is low to a human, because $I[\rX, \rZ]$ is low when $\rX$ is the distribution of TV static.</p>
<p>Note that the signal, “TV noise”, is still rather informative, if you consider the space of all possible labels you could assign to the TV screen, e.g. “SpongeBob” or “sitcom”. Further, that you are looking at a TV and not anything else is information.</p>
<h2 id="2-shannon-information-is-blind-to-scrambling"><a class="header-anchor" href="#2-shannon-information-is-blind-to-scrambling">2. Shannon Information is Blind to Scrambling</a></h2>
<p>Encryption scrambles information to make it inaccessible to prying eyes. Encryption is usually lossless, meaning the original message is fully recoverable. If $\rX$ is a distribution over messages, then the encryption function Enc should preserve that distribution. To Shannon information, $\rX$ and $\text{Enc}(\rX)$ contain the same information. Shannon information is therefore blind to operations like scrambling which do something interesting to the information present, i.e. like making it accessible or inaccessible.</p>
<p>The resolution is again mutual information. While permuting message space (or any bijective transformation) does not change information content under Shannon, it changes the useful information content. A human looking at (or otherwise perceiving) a message is creating a casual link between the message and a representation in the brain. This link has mutual information. Likewise, any measurement apparatus establishes a link between physical state and a representation of that state (the measurement result), again establishing mutual information.</p>
<p>Information in a message becomes inaccessible or useless when the representation of the message cannot distinguish between two messages. Encryption maps the part of message space that human brains can discriminate, i.e. meaningful English sentences (or other such meaningful content) to a part of message space that humans cannot discriminate, i.e. apparently arbitrary character strings. These arbitrary strings appear to be meaningless because they are all mapped to the same or similar representation in our heads, namely the “junk text” label. In short, mutual information between plain text and brain states is much higher than mutual information between encrypted text and brain states.</p>
<h2 id="3-deterministic-information"><a class="header-anchor" href="#3-deterministic-information">3. Deterministic information</a></h2>
<p>How is data on disk contain information if it is fixed and known? How does the output of a deterministic computer program contain information? How do math proofs contain information? All these things do not have an inherent probability distribution. If there is uncertainty, we might call it <span class="marginnote-outer"><span class="marginnote-ref">logical uncertainty</span><label for="0e18784002a64f3be42ab79f74a4282e0d11c58f" class="margin-toggle"> ⊕</label><input type="checkbox" id="0e18784002a64f3be42ab79f74a4282e0d11c58f" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">See <a href="https://intelligence.org/2016/04/21/two-new-papers-uniform/">New papers dividing logical uncertainty into two subproblems</a><br />and <a href="https://golem.ph.utexas.edu/category/2016/09/logical_uncertainty_and_logica.html">Logical Uncertainty and Logical Induction</a></span></span></span>. It is an open question whether logical uncertainty and empirical uncertainty should be conflated, and both brought under the umbrella of probability theory.</p>
<p>This is similar to asking, how does Shannon information account for what I already know? When I observe a message I didn’t already know it is informative, but what about the information contained in messages I currently have? It is also an open question whether probability should be considered objective or subjective, and whether quantities of information are objective or subjective. Perhaps you regard a message you have to be informative, because you are implicitly modeling its information w.r.t. some other receiver who has not yet received it.</p>
<h2 id="4-if-the-universe-is-continuous-everything-contains-infinite-information"><a class="header-anchor" href="#4-if-the-universe-is-continuous-everything-contains-infinite-information">4. If the universe is continuous everything contains infinite information</a></h2>
<p>This one is resolved by the discussion above about mutual information of continuous distributions being finite, so long as there is noise between the two random variables. Thus, in a universe where all measurements are noisy, mutual information is always finite regardless of the underlying meta-physics (whether objects contain finite or infinite information in an absolute sense).</p>
<h2 id="5-shannon-information-ignores-the-meaning-of-messages"><a class="header-anchor" href="#5-shannon-information-ignores-the-meaning-of-messages">5. Shannon information ignores the meaning of messages</a></h2>
<p>There is a competing information theory, <a href="https://en.wikipedia.org/wiki/Algorithmic_information_theory">algorithmic information theory</a> which uses the length of the shortest program that can output a message $x$ as the information measure of $x$, called <a href="https://en.wikipedia.org/wiki/Kolmogorov_complexity">Kolmogorov complexity</a>. If $x$ is less compressible, it contains more information. This is analogous to low $p_X(x)$ leading to its optimal <a href="https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding">Shannon-Fano</a> being longer, and thus containing more information.</p>
<p>Algorithmic information theory addresses the criticism that $h(x)$ depends only on the probability of $x$, rather than the meaning of $x$. If $x$ is a word, sentence, or even a book, the information content of $x$ supposedly does not depend on what the text is! Algorithmic information theory defines information as a property of the content of $x$ as a string, and drops the dependency on probability.</p>
<p>I think this criticism does not consider what <em>meaning</em> is. A steel-man’ed Shannon information at least seems self-consistent to me. Again, the right approach is to use mutual information. I propose that the meaning of a piece of text is ultimately due to the brain state it invokes in you when you read it. Your <a href="https://www.deeplearningbook.org/contents/representation.html">representation</a> of the text shares information with the text. So while yes the probability of $x$ in the void may be meaningless, the joint probability of $(x, z)$ where $z$ is your brain state is what gives $x$ meaning. Shannon information being blind to what we are calling the contents of a message can be seen as a virtue. In other words, Shannon is blind to <em>preconceived</em> meaning. While statistical variance cares about the Euclidean distance between points in $\mathbb{R}^n$, entropy does not and should not if the mathematical representation of these points as vectors is not important. Shannon does not care what you label your points! Their meaning comes solely from their co-occurrence with other random variables.</p>
<p>I think condensing a string of text, like a book, into one random variable $\rX$ is very misleading, because this distribution factors! A book is a single outcome from a distribution over all strings of characters, and we write this distribution as $p(\rC_i \mid \rC_{i-1}, \ldots, \rC_2, \rC_1)$ where $\rC_i$ is the random variable for the $i$-th character in the book. In this way, each character position contains semantic information in its probability distribution conditioned on the previous character choices. The premise of <a href="https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf">language modeling</a> in machine learning is that the statistical relationships between words (their frequencies of co-occurrence) in a corpus of text <span class="marginnote-outer"><span class="marginnote-ref">determine their meaning</span><label for="420e4613ec913eb3ebc7f105b4ba7df0378bbd7b" class="margin-toggle"> ⊕</label><input type="checkbox" id="420e4613ec913eb3ebc7f105b4ba7df0378bbd7b" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The theory goes that a computer which can estimate frequencies of words very precisely would implicitly have to create internal representations of those words which encode their meaning, and so beefed up language modeling is all that is needed for intelligence.</span></span></span></p>
<h2 id="6-probability-distributions-are-not-objective"><a class="header-anchor" href="#6-probability-distributions-are-not-objective">6. Probability distributions are not objective</a></h2>
<p>I touched on this already. Probability has two interpretations: frequentist (objective) and Bayesian (subjective). It is unclear if frequentist probability is an objective property of matter. For repeatable controlled experiments, a frequentist description is reasonable, like in games of chance, and in statistical mechanics and quantum mechanics. When probability is extended to systems that don’t repeat in any meaningful sense, like the stock market or historical events, the objectiveness is dubious. There is a camp that argues probability should reflect the state of belief of an observer, and is more a measurement of the brain doing the observing than the thing being observed.</p>
<p>So then this leads to an interesting question: is Shannon information a property of a system being observed, or a property of the observer in relation to it (or both together)? Is information objective in the sense that multiple independent parties can do measurements to verify a quantity of information, or is it subjective in the sense that it depends on the beliefs of the person doing the calculating? I am not aware of any answer or consensus on this question for information in general,</p>
<h1 id="appendix"><a class="header-anchor" href="#appendix">Appendix</a></h1>
<h2 id="properties-of-conditional-entropy"><a class="header-anchor" href="#properties-of-conditional-entropy">Properties of Conditional Entropy</a></h2>
<p>Source: https://en.wikipedia.org/wiki/Conditional_entropy#Properties</p>
<p>$H[\rY \mid \rX] = H[(\rX, \rY)] - H[\rX]$</p>
<p>Bayes’ rule of conditional entropy:<br />
$H[\rY \mid \rX] = H[\rX \mid \rY] - H[\rX] + H[\rY]$</p>
<p>Minimum value:<br />
$H[\rY \mid \rX] = 0$ when $p(y \mid x)$ is always deterministic, i.e. one-hot, i.e. $p(y \mid x) \in \{0, 1\}$ for all $(x, y) \in X \times Y$.</p>
<p>Maximum value:<br />
$H[\rY \mid \rX] = H[\rY]$ when $\rX, \rY$ are independent.</p>
<h2 id="bayes-rule"><a class="header-anchor" href="#bayes-rule">Bayes’ Rule</a></h2>
<div class="kdmath">$$
p(y \mid x) = p(x \mid y)p(y)/p(x)
$$</div>
<p>can be rewritten in terms of self-information:</p>
<div class="kdmath">$$
h(y \mid x) = h(x \mid y) + h(y) - h(x)\,.
$$</div>
<p>The information contained in $y$ given $x$ is proportional to the information contained in $x$ given $y$ plus the information contained in $y$. This is just Bayes’ rule in log-space, but makes it a bit easier to reason about what Bayes’ rule is doing. Whether $y$ is likely in its own right and whether $x$ is likely given $y$ both contribute to the total information.</p>
<h2 id="cross-entropy-and-kl-divergence"><a class="header-anchor" href="#cross-entropy-and-kl-divergence">Cross Entropy and KL-Divergence</a></h2>
<p>Unlikely everything we’ve seen so far, these are necessarily functions of probability functions, rather than random variables. Further, these are both comparisons of probability functions over the same support.</p>
<div class="kdmath">$$
H[P,Q] = -\sum_x P(x)\log Q(x)
$$</div>
<div class="kdmath">$$
\kl{P}{Q} = \sum_{x} P(x)\log
{\frac{P(x)}{Q(x)}}
$$</div>
<div class="kdmath">$$
\kl{P}{Q} = H[P,Q] - H[P]
$$</div>
<p>Sources:</p>
<ul>
<li><a href="https://stats.stackexchange.com/questions/111445/analysis-of-kullback-leibler-divergence">https://stats.stackexchange.com/questions/111445/analysis-of-kullback-leibler-divergence</a></li>
<li><a href="https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence">https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence</a></li>
</ul>
<p>Mutual information can be <a href="https://en.wikipedia.org/wiki/Mutual_information#Relation_to_Kullback%E2%80%93Leibler_divergence">written in terms of KL-divergence</a>:</p>
<div class="kdmath">$$
I[\rX, \rY] = \kl{p_{X,Y}}{p_X \cdot p_Y} = \E_{x \sim \rX}\left[\kl{p_{Y\mid X}}{p_Y}\right]\,,
$$</div>
<p>where $(p_X \cdot p_Y)(x, y) \mapsto p_X(x) \cdot p_Y(x)$ and $p_{Y\mid X}(y \mid x) \mapsto p_{X,Y}(x,y)/p_X(x)$.</p>
<h1 id="acknowledgments"><a class="header-anchor" href="#acknowledgments">Acknowledgments</a></h1>
<p>I would like to thank John Chung for extensive and Aneesh Mulye for excruciating feedback on the structure and language of this post.</p>
Tue, 09 Jun 2020 00:00:00 -0700
danabo.github.io/zhat/articles/primer-shannon-information
danabo.github.io/zhat/articles/primer-shannon-informationpostNotes: Wallace - Emergence of particles from QFTFri, 10 Jan 2020 00:00:00 -0800
danabo.github.io/zhat/articles/wallace-particles-qft
danabo.github.io/zhat/articles/wallace-particles-qftNotesNotes: Visualizing Quantum Field StatesFri, 10 Jan 2020 00:00:00 -0800
danabo.github.io/zhat/articles/visualizing-quantum-fields
danabo.github.io/zhat/articles/visualizing-quantum-fieldsNotesNotes: Solomonoff InductionTue, 31 Dec 2019 00:00:00 -0800
danabo.github.io/zhat/articles/solomonoff-induction
danabo.github.io/zhat/articles/solomonoff-inductionNotesNotes: Weak Measurement (Quantum Mechanics)Tue, 24 Dec 2019 00:00:00 -0800
danabo.github.io/zhat/articles/weak-measurement
danabo.github.io/zhat/articles/weak-measurementNotes