Z-HatA (we)blog devoted to finding better representations
pragmanym.github.io/zhat/
Fri, 20 Nov 2020 18:35:03 -0800Fri, 20 Nov 2020 18:35:03 -0800Jekyll v3.8.6Primer to Probability Theory and Its Philosophy<p>Probability is a measure defined on events, which are sets of primitive outcomes. Probability theory mostly comes down to constructing events and measuring them. A measure is a generalization of size which corresponds to length, area, and volume (rather than the bijective mapping definition of cardinality).</p>
<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#definitions" id="markdown-toc-definitions">Definitions</a> <ul>
<li><a href="#beginner" id="markdown-toc-beginner">Beginner</a></li>
<li><a href="#full-definition" id="markdown-toc-full-definition">Full Definition</a></li>
<li><a href="#kolmogorov-axioms-of-probability" id="markdown-toc-kolmogorov-axioms-of-probability">Kolmogorov axioms of probability</a></li>
<li><a href="#examples" id="markdown-toc-examples">Examples</a></li>
<li><a href="#pmfs-and-pdfs-and-measures-oh-my" id="markdown-toc-pmfs-and-pdfs-and-measures-oh-my">PMFs and PDFs and measures, oh my!</a></li>
<li><a href="#events-vs-samples" id="markdown-toc-events-vs-samples">Events vs samples</a></li>
</ul>
</li>
<li><a href="#constructing-events" id="markdown-toc-constructing-events">Constructing events</a> <ul>
<li><a href="#random-variables" id="markdown-toc-random-variables">Random variables</a> <ul>
<li><a href="#motivation-1-information-hiding" id="markdown-toc-motivation-1-information-hiding">Motivation 1: Information hiding</a> <ul>
<li><a href="#examples-1" id="markdown-toc-examples-1">Examples</a></li>
</ul>
</li>
<li><a href="#motivation-2-syntactic-sugar" id="markdown-toc-motivation-2-syntactic-sugar">Motivation 2: Syntactic sugar</a> <ul>
<li><a href="#probability-distribution-of-a-random-variable" id="markdown-toc-probability-distribution-of-a-random-variable">Probability distribution of a random variable</a></li>
<li><a href="#notational-confusion" id="markdown-toc-notational-confusion">Notational confusion</a></li>
</ul>
</li>
<li><a href="#motivation-3-construct-events-that-are-guaranteed-measurable" id="markdown-toc-motivation-3-construct-events-that-are-guaranteed-measurable">Motivation 3: Construct events that are guaranteed measurable</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#almost-surely" id="markdown-toc-almost-surely">Almost surely</a> <ul>
<li><a href="#throwing-darts" id="markdown-toc-throwing-darts">Throwing darts</a></li>
<li><a href="#borels-law-of-large-numbers" id="markdown-toc-borels-law-of-large-numbers">Borel’s law of large numbers</a></li>
</ul>
</li>
<li><a href="#primer-to-measure-theory" id="markdown-toc-primer-to-measure-theory">Primer to measure theory</a></li>
</ul>
<script type="math/tex; mode=display">\newcommand{\bin}{\mathbb{B}}
\newcommand{\nat}{\mathbb{N}}
\newcommand{\real}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\d}{\mathrm{d}}
\newcommand{\len}[1]{\ell\left(#1\right)}
\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\bigmid}{\;\middle\vert\;}</script>
<p>Sections:</p>
<ol>
<li><a href="#definitions">Definitions</a> - explain the definition of probability.</li>
<li><a href="#constructing-events">Constructing event</a> - explain random variable notation.</li>
<li><a href="#almost-surely">Almost surely</a> - a philosophical excursion into the interpretation of probability.</li>
<li><a href="#primer-to-measure-theory">Primer to measure theory</a> - a brief introduction to measure theory.</li>
</ol>
<p>Main references:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Probability_axioms#Axioms">https://en.wikipedia.org/wiki/Probability_axioms#Axioms</a></li>
<li><a href="https://en.wikipedia.org/wiki/Measure_space">https://en.wikipedia.org/wiki/Measure_space</a></li>
<li><a href="https://en.wikipedia.org/wiki/Random_variable#Measure-theoretic_definition">https://en.wikipedia.org/wiki/Random_variable#Measure-theoretic_definition</a></li>
<li><a href="http://statweb.stanford.edu/~souravc/stat310a-lecture-notes.pdf">http://statweb.stanford.edu/~souravc/stat310a-lecture-notes.pdf</a></li>
<li><a href="https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf">https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf</a></li>
</ul>
<p>The first half of this article is ostensibly devoted to explaining the definition of probability, but that is not my priority. I’m most interested in providing a useful conceptual map, asking and discussing interesting questions, and developing intuition. I provide many links to technical details and further readings. My opening exposition on definitions is brief. If it does not all make sense, please look at other resources. Hopefully this article at least makes those other sources easier to use.</p>
<p>This post is also a pedagogical experiment. I structured this article to be read twice. The first pass is without measure theory, and the second pass is with measure theory. Measure-theory content is hidden by default, e.g. <span class="advanced outer hidden"><span class="advanced inner hidden">like this</span></span>. Simply ignore <span class="advanced outer hidden"><span class="advanced inner hidden">purple text</span></span> the <span class="marginnote-outer"><span class="marginnote-ref">first time</span><label for="796b5f2304883e2b8a1737199bea4fcbd95c2af7" class="margin-toggle"> ⊕</label><input type="checkbox" id="796b5f2304883e2b8a1737199bea4fcbd95c2af7" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Unless you are already acquainted with measure theory, but then you can just look at <a href="https://en.wikipedia.org/wiki/Probability_axioms#Axioms">Wikipedia’s definition of probability</a> to understand the gist of probability theory.</span></span></span> you read this post. Then in the <a href="#primer-to-measure-theory">measure theory section</a> at the end of this post you will see a button to show all the hidden text (and you can just click on <span class="advanced outer hidden"><span class="advanced inner hidden">purple text</span></span> anywhere in the post to show it).</p>
<p>Why? Because I see plenty of introductions to probability that leave out measure theory entirely. The problem with them is that a lot of the common probability notation, e.g. random variables, only really makes sense when you understand measures. On the other hand, if you crack open a rigorous text on probability theory (e.g. <a href="https://www.goodreads.com/book/show/383472.Statistical_Inference">Casella & Berger</a> or <a href="https://www.springer.com/gp/book/9780387953823">Shao</a>), it may not be obvious why all this extra complexity with events, sigma-algebras and measure spaces is necessary.</p>
<p>When learning about probability and measure theory myself, I wish I had a resource that both provides precise definitions and intuitions for why these definitions are the way they are, and without wading through a lot of extraneous details. I don’t know if I’ve succeeded in that here, but this is my attempt.</p>
<h1 id="definitions"><a class="header-anchor" href="#definitions">Definitions</a></h1>
<h2 id="beginner"><a class="header-anchor" href="#beginner">Beginner</a></h2>
<p>The full definition of probability is below, but to avoid overwhelm, you may first look at this <em>attempt</em> at defining probability. Many people intuitively think of probability this way. Notably, I’ve left out the event space.</p>
<p><strong>Sample set</strong> $\Omega$ is a set of all possible <span class="marginnote-outer"><span class="marginnote-ref">samples</span><label for="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle"> ⊕</label><input type="checkbox" id="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Sample is synonymous with <a href="https://en.wikipedia.org/wiki/Outcome_(probability)">outcome</a>.</span></span></span> $\omega\in\Omega$. A sample is a possible state of the world, e.g. the outcomes for all coins that will be tossed or all dice that will be thrown, or the ordering of cards in a deck.</p>
<p><strong>Probability function</strong> <span class="marginnote-outer"><span class="marginnote-ref">$P : 2^\Omega \to [0, 1]$</span><label for="0574240032f5e0595cbf5977a6d3da50278d4847" class="margin-toggle"> ⊕</label><input type="checkbox" id="0574240032f5e0595cbf5977a6d3da50278d4847" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">$2^\Omega$ is the <a href="https://en.wikipedia.org/wiki/Power_set">power set</a> of $\Omega$. The notation $2^{(\cdot)}$ is just a shorthand, though set exponentiation could be <a href="https://math.stackexchange.com/a/901742">defined in general</a>, e.g. $A^B$ is the set of all functions $f : B \to A$, and $n^A$, where $n$ is a natural number, generates the set of all $n$-ary indicator functions <script type="math/tex">f : A \to \{0, 1, 2, \ldots, n-1\}</script>. Then $2^A$ gives us all indicator functions <script type="math/tex">A \to \{0,1\}</script> select the elements of every subset of $A$.</span></span></span> gives the probability of a <span class="marginnote-outer"><span class="marginnote-ref">set of samples</span><label for="ad8b3af703fafd99caa4527f4b15b75a1975d039" class="margin-toggle"> ⊕</label><input type="checkbox" id="ad8b3af703fafd99caa4527f4b15b75a1975d039" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">subset of $\Omega$</span></span></span>. A set of samples ${\omega_1, \omega_2, \ldots}$ is called an <strong>event</strong>, which is a set of possible states the world could be in, read as “$\omega_1$ is the case or $\omega_2$ is the case, etc. …”</p>
<p>$P$ satisfies:</p>
<ul>
<li><strong>Non-negativity</strong>: <script type="math/tex">P(e) \geq 0,\ \forall e \in 2^\Omega</script>.</li>
<li><strong>Null empty set</strong>: <script type="math/tex">P(\emptyset) = 0</script>.</li>
<li><strong>Unit sample set</strong>: <script type="math/tex">P(\Omega) = 1</script>.</li>
<li><strong>Additivity</strong>: For all disjoint events <script type="math/tex">e_1, e_2 \in 2^\Omega,\ P(e_1 \cup e_2) = P(e_1) + P(e_2)</script></li>
</ul>
<p>The probability of a single sample (outcome) $\omega\in\Omega$ is <script type="math/tex">P(\{\omega\})</script>.</p>
<h2 id="full-definition"><a class="header-anchor" href="#full-definition">Full Definition</a></h2>
<p>The beginner definition above does not define an event space. This is actually a problem when working with uncountable sample spaces, because not all subsets of an uncountable space can be measured. If that statement confuses you, don’t worry about it and read through this post. Then read my <a href="#primer-to-measure-theory">primer to measure theory</a> at the end which outlines why not every set can be measured. Though this may seem like a minor technicality, specifying what sets can be measured allows probability theory to be <span class="marginnote-outer"><span class="marginnote-ref">a lot more general than it otherwise could be,</span><label for="a0359426bd6a552c43a72d1b7b5e568b41b95f0d" class="margin-toggle"> ⊕</label><input type="checkbox" id="a0359426bd6a552c43a72d1b7b5e568b41b95f0d" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">This is Kolmogorov’s achievement. Definitions of probability like my beginner definition had been around for hundreds of years prior.</span></span></span> specifically when dealing with real numbers.</p>
<p>Here is a compact but complete definition of probability:</p>
<ul>
<li><strong>Sample set</strong> $\Omega$ is a set of all possible <span class="marginnote-outer"><span class="marginnote-ref">samples</span><label for="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle"> ⊕</label><input type="checkbox" id="46f0a5784f51b77c385f44317a48bc352dcfb439" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Sample is synonymous with <a href="https://en.wikipedia.org/wiki/Outcome_(probability)">outcome</a>.</span></span></span>.
<ul>
<li><strong>Sample</strong> $\omega \in \Omega$ (i.e. primitive outcome) is a possible state of the world. Samples are disjoint, meaning only one sample can be the case at a time. Samples can be any kind of mathematical object.</li>
</ul>
</li>
<li><strong>Event space</strong> $E \subseteq 2^\Omega$ is the set of subsets of $\Omega$ for which we are <span class="marginnote-outer"><span class="marginnote-ref">allowed to assign probability.</span><label for="f8b43cebd8d878851c96f6b83253241ae0914c18" class="margin-toggle"> ⊕</label><input type="checkbox" id="f8b43cebd8d878851c96f6b83253241ae0914c18" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The sets omitted from $E$ are not measurable, but again if you are not familiar with measure theory don’t worry about why some sets cannot be measured until the end of this post.</span></span></span> We require that $\emptyset, \Omega \in E$ <span class="advanced outer hidden"><span class="advanced inner hidden">and $E$ is required to be a <a href="#sigma-algebra">$\sigma$-algebra</a> that contains the measurable subsets of $\Omega$. The tuple $(\Omega, E)$ is a <a href="https://en.wikipedia.org/wiki/Measurable_space">measurable space</a>.</span></span>
<ul>
<li><strong>Event</strong> $e \in E$ is a <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> set of samples. Samples $\omega \in e$ are <span class="marginnote-outer"><span class="marginnote-ref">considered identical</span><label for="b69039dc8767a7b19268d3134cd1eccd7a91f7de" class="margin-toggle"> ⊕</label><input type="checkbox" id="b69039dc8767a7b19268d3134cd1eccd7a91f7de" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Different samples in $\Omega$ are indeed distinct objects, but their difference does not matter in the context of event $e$.</span></span></span> w.r.t. $e$.</li>
</ul>
</li>
<li><strong>Probability measure</strong> <span class="marginnote-outer"><span class="marginnote-ref">$P : E \to [0, 1]$</span><label for="4ad9ab16708369ac059ab042be9927eff00636f9" class="margin-toggle"> ⊕</label><input type="checkbox" id="4ad9ab16708369ac059ab042be9927eff00636f9" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In general a measure $Q : E \to \real_{\geq 0}$, but I’m including the restriction of the co-domain to the unit interval $[0, 1]$ in the definition of <script type="math/tex">P</script>, because we are only talking about probability measures here, and there’s no reason to be more general.</span></span></span> is a function that maps allowed subsets of $\Omega$ to the real unit interval. $P$ is a <strong>measure</strong>, which means it satisfies certain properties that make it behave analogous to length, area, volume, etc. in Euclidean space. Essentially, a measure is a generalization of size that satisfies the following properties:
<ul>
<li><span class="advanced outer hidden"><span class="advanced inner hidden"><strong>Measurable domain</strong>: $E$ is a $\sigma$-algebra of measurable sets.</span></span></li>
<li><strong>Non-negativity</strong>: <script type="math/tex">P(e) \geq 0,\ \forall e \in E</script>.</li>
<li><strong>Null empty set</strong>: <script type="math/tex">P(\emptyset) = 0</script>.</li>
<li><strong>Unit sample set</strong>: <script type="math/tex">P(\Omega) = 1</script>.</li>
<li><strong>Countable additivity</strong>: For any countable set of events <span class="marginnote-outer"><span class="marginnote-ref"><script type="math/tex">A \subseteq E</script></span><label for="1c4c2f199fe96ef43d1f0b6cc492af4deb8af6af" class="margin-toggle"> ⊕</label><input type="checkbox" id="1c4c2f199fe96ef43d1f0b6cc492af4deb8af6af" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Remember that $E$ is the event space, and $A$ is a set of events.</span></span></span> where <script type="math/tex">\bigcap A = \emptyset</script>, <script type="math/tex">P(\bigcup A) = \sum_{e\in A} P(e)</script>.</li>
</ul>
</li>
</ul>
<p>The triple $(\Omega, E, P)$ defines a <a href="https://en.wikipedia.org/wiki/Probability_space">probability space</a> <span class="advanced outer hidden"><span class="advanced inner hidden">which is also a <a href="https://en.wikipedia.org/wiki/Measure_space">measure space</a>.</span></span> These three objects are all we need to do probability calculations.</p>
<h2 id="kolmogorov-axioms-of-probability"><a class="header-anchor" href="#kolmogorov-axioms-of-probability">Kolmogorov axioms of probability</a></h2>
<p>You may have heard of the <a href="https://en.wikipedia.org/wiki/Probability_axioms#Axioms">Kolmogorov axioms of probability</a>. Kolmogorov formalized probability as a special case of measure theory. Essentially a probability measure is a normalized measure, i.e. assigns 1 to the entire sample space $\Omega$. Above, I’ve merged the axioms of measure theory with Kolmogorov’s axioms. For reference, here are Kolmogorov’s axioms given separately:</p>
<ol>
<li>$P(e) \in [0, 1], \forall e \in E$, where $[0, 1] \subset \real$.</li>
<li>$P(\Omega) = 1$, i.e. probability of anything happening is 1.</li>
<li><span class="advanced outer hidden"><span class="advanced inner hidden"><a href="https://en.wikipedia.org/wiki/Sigma_additivity">$\sigma$-additivity</a> on $E$.</span></span></li>
</ol>
<p><span class="advanced outer hidden"><span class="advanced inner hidden">Given the axioms of measure theory, we can define probability succinctly by simply stating that $(\Omega, E, P)$ is a measure space where $P(\Omega) = 1$ (see <a href="https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf">Terence Tao’s Introduction to Measure Theory</a>).</span></span></p>
<h2 id="examples"><a class="header-anchor" href="#examples">Examples</a></h2>
<p><strong>Finite</strong>: Dice rolls</p>
<p><script type="math/tex">\Omega = \{⚀,⚁,⚂,⚃,⚄,⚅\}</script>,<br />
<script type="math/tex">E=2^\Omega</script>,<br />
<script type="math/tex">P(\{⚀\})= P(\{⚁\}) = \ldots = P(\{⚅\}) =1/6</script>.</p>
<p>Note that <script type="math/tex">P(⚀)</script> is not defined. $P$ measures the “size” of sets. <script type="math/tex">\{⚀\}</script> is the set containing one sample. We can also compute the probability of larger sets, e.g.<br />
<script type="math/tex">P(\{⚀,⚅\}) = 1/3</script>,<br />
<script type="math/tex">P(\{⚁,⚃,⚅\}) = 1/2</script>,<br />
<script type="math/tex">P(\{⚀,⚁,⚂,⚃,⚄,⚅\}) = 1</script>.</p>
<p><strong>Countable</strong> (event set): Variable length binary sequences</p>
<p><script type="math/tex">\bin = \{0, 1\}</script> is the binary alphabet.<br />
Let $x \in \bin^n$ be a binary sequence of any length $n$, and <script type="math/tex">\len{x} := n</script> returns the length of $x$.</p>
<p>The sample set is all infinite binary sequences, <script type="math/tex">\Omega = \mathbb{B}^\infty</script>.<br />
This let’s us make an event for each finite length $x$.<br />
Let <script type="math/tex">\Gamma_x = \left\{\omega \in \Omega \bigmid x = \omega_{1:\len{x}}\right\}</script>, where <script type="math/tex">\omega_{1:\len{x}}</script> is the length $\len{x}$ prefix of $\omega$.<br />
The event set is <script type="math/tex">E=\left\{\Gamma_x \bigmid x \in \mathbb{B}^n, n \in \mathbb{N}\cup\{0\}\right\}</script></p>
<p>Then <script type="math/tex">P(\Gamma_x)</script> is the probability of $x$, and <script type="math/tex">P(\Gamma_{x_1} \cup \Gamma_{x_2} \cup \ldots)</script> is the probability of the set <script type="math/tex">\{x_1, x_2, \ldots\}</script>.<br />
Note that the probability of a finite sequence is always a marginal probability, in the sense that <script type="math/tex">P(\Gamma_x) = P(\Gamma_{x`0}) + P(\Gamma_{x`1})</script> where <script type="math/tex">x`0</script> and <script type="math/tex">x`1</script> are the concatenations of <script type="math/tex">x</script> with 0 or 1.</p>
<p>An example of such a measure is the uniform measure, <script type="math/tex">P(\Gamma_x) = 2^{-\len{x}}</script>.</p>
<p><strong>Uncountable</strong>: The reals</p>
<p><script type="math/tex">\Omega=\real</script>,<br />
<script type="math/tex">E \subset 2^\real</script> contains sets of reals formed by countable union, intersection, and complement of the open intervals. <span class="advanced outer hidden"><span class="advanced inner hidden">This particular choice of $E$ is called the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel algebra</a>, and is a standard $\sigma$-algebra for $\real$. The reason we don’t use $E = 2^\real$ as our event space is that some subsets of $\real$ are not measurable.</span></span></p>
<p>We only need to define $P$ on single intervals, and because of additivity of probability we can derive $P$ on every set in $E$. <span class="advanced outer hidden"><span class="advanced inner hidden">A measure $P$ defined on intervals is called a <a href="https://en.wikipedia.org/wiki/Borel_measure#On_the_real_line">Borel measure</a>.</span></span> Let</p>
<script type="math/tex; mode=display">P((a,b]) = \int_a^b \frac{1}{\sqrt{2 \pi }} e^{-\frac{x^2}{2}} \d x\,.</script>
<p>Note that it does not matter if we define $P$ on open intervals, closed intervals, or half-open intervals, because the value of the integral is identical between these cases. <span class="advanced outer hidden"><span class="advanced inner hidden">Specifically, we are performing a Lebesgue integral, which is invariant to removing a measure 0 subset from the integral domain. See the <a href="https://en.wikipedia.org/wiki/Lebesgue_integration#Basic_theorems_of_the_Lebesgue_integral">equality almost-everywhere</a> property.</span></span></p>
<p>In this particular example, $\frac{\d}{\d x} P((0, x])$ is the <a href="https://en.wikipedia.org/wiki/Normal_distribution">standard normal</a> (i.e. Gaussian) <a href="https://en.wikipedia.org/wiki/Probability_density_function">probability density function (PDF)</a>. It is common, when working with probability on the reals, to provide a PDF which can be integrated over to derive the <span class="marginnote-outer"><span class="marginnote-ref">probability measure</span><label for="4bf04e5f74a071ce2f80b754d2578481d612a33e" class="margin-toggle"> ⊕</label><input type="checkbox" id="4bf04e5f74a071ce2f80b754d2578481d612a33e" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The output of the probability measure is called <em>probability mass</em>, to distinguish it from the output of the PDF, which is called <em>probability density</em>.</span></span></span>. In other words, a PDF $f(x)$ is a function that when integrated produces a probability measure: $P((a, b]) = \int_a^b f(x) \d x$.</p>
<h2 id="pmfs-and-pdfs-and-measures-oh-my"><a class="header-anchor" href="#pmfs-and-pdfs-and-measures-oh-my">PMFs and PDFs and measures, oh my!</a></h2>
<p>In standard probability textbooks and courses (largely for non-theoreticians), you are told about probability mass functions (PMFs) and probability density functions (PDFs), and their cumulative counterparts: cumulative mass functions (CMFs) and cumulative density functions (CDFs). So you may be wondering where these fit into the definition of probability above. I’ve been talking about probability measures, and only mentioned PDF in the real line example above.</p>
<p>For finite and countable sample sets, PMFs, CMFs and measures are equivalent, meaning you can derive one from the others. We can convert between PMF $m : \Omega \to [0,1]$ and measure $P: E \to [0,1]$ with the following relations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
m(\omega) &= P(\{\omega\}) \\
P(e) &= \sum_{\omega\in e} m(\omega)\,.
\end{aligned} %]]></script>
<p>For differentiable continuous sample sets <span class="advanced outer hidden"><span class="advanced inner hidden">where $E$ is the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel algebra</a></span></span> (e.g. the reals), PDFs, CDFs and measures are equivalent, meaning you can derive one from the others. We can convert between PDF $f : \Omega \to \real$ and measure $P : E \to [0,1]$ with the following relations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(x) &= \frac{\d}{\d x} P((c, x]) \\
P((a, b]) &= \int_a^b f(x) \d x\,,
\end{aligned} %]]></script>
<p>for some constant $c\in\Omega$.</p>
<p>The measure-theoretic definition of probability unifies the discrete and continuous cases, and can handle exotic cases, e.g. non-differentiable uncountable sample sets.</p>
<h2 id="events-vs-samples"><a class="header-anchor" href="#events-vs-samples">Events vs samples</a></h2>
<p><strong>Question:</strong> Why provide event space $E$? Isn’t this redundant with $\Omega$?</p>
<p>You may be thinking that given just $\Omega$, we can define $P : 2^\Omega \to [0,1]$ which satisfies the properties of a measure listed earlier, and it is sufficient to define <script type="math/tex">P(\{\omega\})</script> for each $\omega \in \Omega$. That is true for countable $\Omega$ (e.g. the dice example above). The technical reason for basing probability theory on measure theory is that for uncountable $\Omega$, some subsets are not measurable. $E$ tells us which subsets of $\Omega$ are measurable, and are safe to compute the probability of. Perhaps the real reason is to simplify the definition of probability down to one constraint, $P(\Omega) = 1$. The apparent redundancy of $\Omega$ and $E$ is then inherited from measure theory. This kind of information redundancy <span class="marginnote-outer"><span class="marginnote-ref">in mathematical constructions is quite common</span><label for="fae4f9d36286bf3dcf9d3f2d43145ea41509157a" class="margin-toggle"> ⊕</label><input type="checkbox" id="fae4f9d36286bf3dcf9d3f2d43145ea41509157a" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For example, a group is defined as $(G, +)$ where $G$ is a set of objects and $+ : G \times G \to G$ is some binary operator defined over $G$. The definition of $+$ already includes $G$, so technically providing $G$ is not necessary. A group is defined as a tuple $(G, +)$ to distinguish it from the set $G$ and the operator $+$. Another example is a topological space defined as the tuple $(X, \tau)$ where $X$ is a set of objects and $\tau$ is a set of subsets of $X$ which contains $X$. Since $X = \bigcup \tau$ so we don’t need to provide $\tau$, but again we want to distinguish the topological space from $X$ and $\tau$ (where $\tau$ is just called the topology).</span></span></span>, and is merely a particular notational style. Redundancy is not a high cost to pay for notational clarity.</p>
<p><strong>Question:</strong> Why do I care about events containing multiple samples? Only one sample ever happens at a time.</p>
<ol>
<li>We want to be able to calculate the probability of “one or the other thing” happening. Let <script type="math/tex">\omega_1, \omega_2 \in \Omega</script>. <script type="math/tex">\{\omega_1\}, \{\omega_2\} \in E</script> are the events corresponding to exactly one thing happening. <script type="math/tex">\{\omega_1, \omega_2\} \in E</script> is the event corresponding to either <script type="math/tex">\omega_1</script> or <script type="math/tex">\omega_2</script> happening.</li>
<li>We want to be able to calculate the probability of something not happening. Not-<script type="math/tex">\omega_1</script> is the event <script type="math/tex">\{\omega \in \Omega \mid \omega \neq \omega_1\}</script>.</li>
</ol>
<p><strong>Question:</strong> But what about the probability of “one <strong>AND</strong> the other thing” happening?</p>
<p>Samples in <script type="math/tex">\Omega</script> each represent exactly one unique state of the world. To say the world is in state $\omega_i$ AND $\omega_j$ simultaneously is a contradiction, since each states on its own is complete, in the sense that they specify everything. However, it may be the case that world-state can be decomposed into two independent parts. Then your sample set is the cartesian product of sets for each independent sub-state, i.e. <script type="math/tex">\Omega = \Lambda_1 \times \Lambda_2</script> and <script type="math/tex">\omega = (\lambda_1, \lambda_2) \in \Lambda_1 \times \Lambda_2</script>. Thus each sample <script type="math/tex">\omega</script> already represents the “and” of two states if you want it to.</p>
<h1 id="constructing-events"><a class="header-anchor" href="#constructing-events">Constructing events</a></h1>
<p>A primitive event is a <span class="marginnote-outer"><span class="marginnote-ref">singleton set</span><label for="d110d830c28b4b739bdd1217694def459e015af9" class="margin-toggle"> ⊕</label><input type="checkbox" id="d110d830c28b4b739bdd1217694def459e015af9" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The set containing one sample, i.e. <script type="math/tex">e =\{\omega\}</script> where <script type="math/tex">\omega \in \Omega</script></span></span></span>. Events are <span class="marginnote-outer"><span class="marginnote-ref">what get observed, not samples</span><label for="0b2310df56ec7cd2cb22ca9570daf625df84da24" class="margin-toggle"> ⊕</label><input type="checkbox" id="0b2310df56ec7cd2cb22ca9570daf625df84da24" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">See the <a href="#throwing-darts">dart-throwing discussion</a> below for a good reason why this should be the case.</span></span></span>. If an event contains many samples, you don’t know which of them is the case, but only one can be the case since they are disjoint.</p>
<p>Probability theory has specialized notation that revolves around turning the “define my event and measure its probability” process into one concise notational step. Random variables (RV)s are central to this notation. But before introducing random variables, let’s look at how we would construct events and measure their probability without RVs:</p>
<ul>
<li><strong>Construct event:</strong> <script type="math/tex">e = \{\omega \in \Omega \mid \mathrm{condition}(\omega)\}</script>, where <script type="math/tex">\mathrm{condition}(\omega)</script> is some boolean valued proposition on $\omega$.</li>
<li><strong>Measure probability:</strong> $P(e)$. So long as <script type="math/tex">e \in E</script>, then <script type="math/tex">P(e)</script> is defined.</li>
</ul>
<p>Combined we have,</p>
<script type="math/tex; mode=display">P(\{\omega \in \Omega \mid \mathrm{condition}(\omega)\})\,.</script>
<p>For example, if $\Omega = \nat$ and we wanted to compute the probability of getting an even number, then <script type="math/tex">e = \{n \in \nat \mid \mathrm{Remainder}(n/2) = 0\}</script> and <script type="math/tex">P(\{n \in \nat \mid \mathrm{Remainder}(n/2) = 0\})</script> is the probability.</p>
<h2 id="random-variables"><a class="header-anchor" href="#random-variables">Random variables</a></h2>
<p>Random variables are devices for constructing events. That is their purpose. Contrary to their name, there is <span class="marginnote-outer"><span class="marginnote-ref">nothing random about them.</span><label for="250337d20f972049cc956351c6be818ab040f059" class="margin-toggle"> ⊕</label><input type="checkbox" id="250337d20f972049cc956351c6be818ab040f059" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">A random variable is a deterministic function. The word <em><strong>random</strong></em> is due to it being a function of samples which are randomly chosen.</span></span></span></p>
<p>A random variable is a <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> function <script type="math/tex">X : \Omega \to F</script>, <span class="advanced outer hidden"><span class="advanced inner hidden">where $(F, \mathcal{F})$ is a <a href="https://en.wikipedia.org/wiki/Measurable_space">measurable space</a> with $\sigma$-algebra $\mathcal{F}$ (specifies measurable subsets of $F$),</span></span> and the elements of $F$ can be any type of object.</p>
<p>There are three main motivations for the random variable formalism…</p>
<h3 id="motivation-1-information-hiding"><a class="header-anchor" href="#motivation-1-information-hiding">Motivation 1: Information hiding</a></h3>
<p>I briefly mentioned <a href="#events-vs-samples">above</a> that samples (world state) can be treated as containing sub-samples (sub-state), e.g. $\omega = (\lambda_1, \lambda_2) \in \Lambda_1 \times \Lambda_2 = \Omega$. Random variables are convenient for dealing with just one sub-sample in isolation, and they allow you to avoid committing to a particular way to divide up $\omega$, e.g. $\omega = (\lambda_1, \lambda_2) = (\kappa_1, \kappa_2, \kappa_3)$ might be two different and incompatible but semantically meaningful ways to divide sample $\omega$ into sub-samples.</p>
<p>A random variable $X : \Omega \to F$ <em>hides information</em> contained in $\omega \in \Omega$ by appropriate choice of $F$. E.g. let $\Omega = \Lambda_1 \times \Lambda_2$ and let <script type="math/tex">X_1 : \Omega \to \Lambda_1 : (\lambda_1, \lambda_2) \mapsto \lambda_1</script> and <script type="math/tex">X_2 : \Omega \to \Lambda_2 : (\lambda_1, \lambda_2) \mapsto \lambda_2</script> be two random variables. $X_1(\Omega) = \Lambda_1$ and $X_2(\Omega)=\Lambda_2$ are smaller sample spaces than $\Omega$, each which hide sub-samples.</p>
<p>When multiple random variables are invoked in the same context, they are assumed to be <span class="marginnote-outer"><span class="marginnote-ref">over the same sample space $\Omega$.</span><label for="2d1ff964981aff5411e5b2e1dc946fe1bd3dfccd" class="margin-toggle"> ⊕</label><input type="checkbox" id="2d1ff964981aff5411e5b2e1dc946fe1bd3dfccd" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For RVs $X_1, X_2, \ldots$ it is assumed there is a joint probability distribution <script type="math/tex">P_{X_1, X_2, \ldots}</script>. See the definition of joint distribution <a href="#probability-distribution-of-a-random-variable">below</a>.</span></span></span></p>
<h4 id="examples-1"><a class="header-anchor" href="#examples-1">Examples</a></h4>
<p><strong>Toss two coins</strong></p>
<p><script type="math/tex">\Omega = \Lambda_1 \times \Lambda_2</script>. <script type="math/tex">(\lambda_1, \lambda_2) \in \Omega</script>. <script type="math/tex">\Lambda_1 = \Lambda_2 = \{H, T\}</script>. <br />
Define <script type="math/tex">X_1 : (\lambda_1, \lambda_2) \mapsto \lambda_1</script> and <script type="math/tex">X_2 : (\lambda_1, \lambda_2) \mapsto \lambda_2</script>.<br />
<script type="math/tex">X_1</script> isolates the state of the first coin. <script type="math/tex">X_2</script> isolates the state of the second coin.<br />
$P(X_1=H) = P(\{\omega \in \Omega \mid X_1(\omega) = H\}) = P(\{(H,H), (H,T)\})$</p>
<p><strong>Toss two dice</strong></p>
<p><script type="math/tex">\Omega = \Lambda_1 \times \Lambda_2</script>. <script type="math/tex">(\lambda_1, \lambda_2) \in \Omega</script>. <script type="math/tex">\Lambda_1 = \Lambda_2 = \{1,2,3,4,5,6\}</script>. <br />
Define <script type="math/tex">S : (\lambda_1, \lambda_2) \mapsto \lambda_1 + \lambda_2</script>.<br />
<script type="math/tex">S</script> returns the sum of the two die outcomes. <br />
The codomain of <script type="math/tex">S</script> is <script type="math/tex">\{2, 3, \ldots, 11, 12\}</script><br />
<script type="math/tex">P(S=4) = P(\{\omega \in \Omega \mid S(\omega) = 4\}) = P(\{(1,3), (2,2), (3, 1)\})</script></p>
<p><strong>In the general case…</strong></p>
<p>we might want to represent any number of interacting observables and components in a system. How about modeling the weather or the stock market? Your primitive sample space might be astronomical, but you can identify all sorts of observables like the prices of AAPL and GOOG at time <script type="math/tex">t</script> or the temperatures of Florida and Vermont on Tuesday, which would be convenient to deal with separately. At the same time, you don’t want to lose the rich information about how one particular observable interacts with all the others. We would like to be able to ignore partial information contained in primitive samples (i.e. <a href="#probability-distribution-of-a-random-variable">marginalize</a>).</p>
<h3 id="motivation-2-syntactic-sugar"><a class="header-anchor" href="#motivation-2-syntactic-sugar">Motivation 2: Syntactic sugar</a></h3>
<p>We’ve seen how events can be constructed with set builder notation, i.e. <script type="math/tex">e = \{\omega \in \Omega \mid \mathrm{condition}(\omega)\}</script>, and we’ve seen how a random variable $X : \Omega \to F$ can be used to build events, e.g. <script type="math/tex">e = \{\omega \in \Omega \mid X(\omega) = f\}</script> where $f \in F$ is some object.</p>
<p>There is a shorthand notation for writing <script type="math/tex">P(\{\omega \in \Omega \mid X(\omega) = f\})</script>, which is</p>
<script type="math/tex; mode=display">P(X=f)\,.</script>
<p>The general case of this notation is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& P(\mathrm{condition}(X_1, X_2, \ldots)) \\
& \quad = P(\{\omega \in \Omega : \mathrm{condition}(X_1(\omega), X_2(\omega), \ldots)\})\,,
\end{align} %]]></script>
<p>where $X_1 : \Omega \to F_1,\ \ X_2 : \Omega \to F_2, \ \ \ldots$ are random variables, and <script type="math/tex">\mathrm{condition}(f_1, f_2, \ldots)</script> is some boolean function of inputs <script type="math/tex">f_1 \in F_1, f_2 \in F_2, \ldots</script> <span class="advanced outer hidden"><span class="advanced inner hidden">with measurable spaces $(F_1, \mathcal{F}_1), (F_2, \mathcal{F}_2), \ldots$</span></span></p>
<p><strong>Examples:</strong></p>
<ul>
<li><script type="math/tex">P(X = Y) = P(\{\omega \in \Omega \mid X(\omega) = Y(\omega)\})</script>, where <script type="math/tex">Y : \Omega \to F</script> is a random variable.</li>
<li><script type="math/tex">P(X=f, Y=g) = P(\{\omega \in \Omega \mid X(\omega)=f, Y(\omega)=g\})</script> where <script type="math/tex">Y:\Omega \to G</script> and <script type="math/tex">g \in G</script>.</li>
<li><script type="math/tex">P(X \in A) = P(\{\omega \in \Omega \mid X(\omega) \in A\})</script>, for <script type="math/tex">A \subseteq F</script> <span class="advanced outer hidden"><span class="advanced inner hidden">(and $A \in \mathcal{F}$ is measurable).</span></span></li>
<li>$P(X > f) = P(\{\omega \in \Omega \mid X(\omega) > f\})$.</li>
<li>$P(X > Y) = P(\{\omega \in \Omega \mid X(\omega) > Y(\omega)\})$.</li>
<li>Arbitrary algebraic expressions of random variables, e.g. <script type="math/tex">P(c_0 + c_1 X + c_2 X^2 + c_3 X^3 + \ldots = k) = P(\{\omega \in \Omega \mid c_0 + c_1 X(\omega) + c_2 X(\omega)^2 + c_3 X(\omega)^3 + \ldots = k\})</script> or <script type="math/tex">P(\exp(X) = \log(Y)) = P(\{\omega \in \Omega \mid \exp(X(\omega)) = \log(Y(\omega))\})</script>.</li>
</ul>
<p>A standard notational convention is that calling a function on a random variable generates a new random variable, i.e. <script type="math/tex">h(X) = h∘X</script>, so that <script type="math/tex">P(h(X) = c)</script> can be parsed either as <script type="math/tex">P(Y = c)</script> where random variable <script type="math/tex">Y = h∘X</script>, or as <script type="math/tex">P(\mathrm{condition}(X))</script> where <script type="math/tex">\mathrm{condition}(x)</script> is the expression <script type="math/tex">h(x) = c</script>.</p>
<h4 id="probability-distribution-of-a-random-variable"><a class="header-anchor" href="#probability-distribution-of-a-random-variable">Probability distribution of a random variable</a></h4>
<p>Any random variable $X : \Omega \to F$ <span class="advanced outer hidden"><span class="advanced inner hidden">to measurable space $(F, \mathcal{F})$</span></span> induces a unique probability measure with $F$ as the sample set, rather than $\Omega$. We call it the <strong>marginal distribution</strong> w.r.t. $X$, defined as <script type="math/tex">P_X: F \to [0, 1]</script>:</p>
<script type="math/tex; mode=display">P_X(A) := P(X \in A) = P(\{\omega \in \Omega \mid X(\omega) \in A\})\,,</script>
<p>for <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> $A \subseteq F$. Thus $(F, \mathcal{F}, P_X)$ is the probability space for the marginal distribution of $X$. Note that <script type="math/tex">P(X=f) = P_X(\{f\})</script>, <script type="math/tex">% <![CDATA[
P(X < f) = P_X(\{f' \in F \mid f' < f\}) %]]></script>, etc.</p>
<p>We often have more than one random variable of interest. With $X$ defined above and $Y : \Omega \to G$ <span class="advanced outer hidden"><span class="advanced inner hidden">to measurable space $(G, \mathcal{G})$</span></span>, we have the marginal distributions $P_X$ and $P_Y$, and also the <strong>joint distribution</strong> w.r.t. $X$ and $Y$, defined as $P_{X,Y} : F \times G \to [0, 1]$:</p>
<script type="math/tex; mode=display">P_{X,Y}(A, B) := P(X \in A \wedge Y \in B) = P(\{\omega \in \Omega \mid X(\omega) \in A \wedge Y(\omega) \in B\})</script>
<p>for <span class="advanced outer hidden"><span class="advanced inner hidden">measurable</span></span> $A \subseteq F, B \subseteq G$. Thus <span class="marginnote-outer"><span class="marginnote-ref">$(F \times G, \mathcal{F} \otimes \mathcal{G}, P_{X,Y})$</span><label for="0a155670fb530d756f4f07e993143bde44206b9e" class="margin-toggle"> ⊕</label><input type="checkbox" id="0a155670fb530d756f4f07e993143bde44206b9e" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">$\mathcal{F} \otimes \mathcal{G} := \{A \times B \mid A \in \mathcal{F}, B \in \mathcal{G}\}$.</span></span></span> is the probability space for the joint distribution of $X$ and $Y$.</p>
<p>In general, for RVs $X_1 : \Omega \to F_1,\ \ X_2 : \Omega \to F_2,\ \ \ldots$, we have the joint distribution $P_{X_1,X_2,\ldots} : F_1 \times F_2 \times \ldots \to [0, 1]$:</p>
<script type="math/tex; mode=display">P_{X_1,X_2,\ldots}(A_1, A_2, \ldots) := P(X_1 \in A_1 \wedge X_2 \in A_2 \wedge \ldots)\,.</script>
<p>A joint distribution may also be a marginal distribution. For example, if I have RVs $X_1, \ldots, X_{10}$ and I consider the probability measure $P_{X_3,X_5,X_7}$.</p>
<p>RVs in a joint distribution need not be created from cartesian products of sample sets, i.e. the output of one RV may partially determine the output of another. Taking the two dice example, my space is <script type="math/tex">\Omega = \{1, \ldots, 6\} \times \{1, \ldots, 6\}</script>. The random variable for the outcome of die 1 is $D_1((n, m)) \mapsto n$, and the random variable for the sum of dice is $S((n, m)) \mapsto n + m$. Choosing $\omega \in \Omega$ to determine $D_1$ may also determine $S$, and vice versa. If I want $S(\omega) = 2$ then $\omega = (1, 1)$ and $D_1(\omega) = 1$ is fully determined. Likewise if we choose $\omega$ so that $D_1(\omega) = 6$ then the possible values of $S(\omega)$ are restricted to $7, 8, 9, 10, 11, 12$. Nevertheless, $P_{D_1, S}$ is a perfectly fine joint distribution.</p>
<p>Keeping track of all these probability functions can be confusing, e.g. marginals $P_X$ and $P_Y$ and joint $P_{X,Y}$ are in a sense derived from a single probability function $P$, where $P(X=x)$ and $P(Y=y)$ are equivalent to <script type="math/tex">P_X(\{x\})</script> and <script type="math/tex">P_Y(\{y\})</script>. However, it is possible to have two different underlying probability measures that reuse the same random variables, e.g. $Q : \Omega \to [0, 1]$ with expressions like $Q(X=x)$ and $Q(Y=y)$ being possible, and marginals $Q_X$ and $Q_Y$ and joint $Q_{X,Y}$. Keep in mind that calculations with $P$-related and $Q$-related probability functions do not necessarily have anything to do with each other.</p>
<h4 id="notational-confusion"><a class="header-anchor" href="#notational-confusion">Notational confusion</a></h4>
<p>The language of probability may seem simple enough, but notationally it can be quite cumbersome. When it comes to applications in statistics, machine learning and physics just to name a few, there can be a large quantity of random variables and complicated probability distributions. Authors of academic texts tend to take shortcuts for ease of readability, but they pay the price of ambiguity, which especially hurts readers who are not already familiar with the domain. This is not the fault of authors, but a symptom of clunky notation. I will outline a few common shortcuts and notational difficulties. I hope to write a separate post delving deeper into examples where ambiguity occurs in the wild and how to avoid it.</p>
<p>Generally in texts there is often ambiguity between PMFs, PDFs and measures, and between samples, events, and random variables.</p>
<p>For example, you may see any of $P(X), p(X), P(x)$ or $p(x)$, where it is not make clear whether $P$ or $p$ is a measure or a PMF/PDF, and whether $X$ or $x$ is a sample, event, or random variable. There is no universal convention on uppercase vs lowercase. Uppercase $X$ can mean a vector or matrix in a lot of contexts, as well as bold $\boldsymbol{X}$. Same situation for marginals, e.g. $p_X(x)$ is common.</p>
<p>When there are many random variables to juggle, you may see different ways to denote marginal distributions, e.g. $P(X,Y)$ and $P_{X,Y}$. This becomes important when you want to do algebra with probability, e.g.</p>
<ol>
<li>$P(W) = P(X, Y, Z)/Q(Y,Z)$</li>
<li>$P_W = P_{X, Y,Z}/Q_{Y,Z}$</li>
<li>$P(W=w) = P(X=f(w), Y=g(w), Z=h(w))/Q(Y=g(w),Z=h(w))$</li>
</ol>
<p>The problem with the first case is that it depends on position for variable identity, but the reader expects identity by name, i.e. $P(X, Y, Z)$ is intended to be the same as $P(Y, Z, X)$. The second case fixes this problem because it cleanly separates the meaning of each argument from its value, e.g. $P_{X,Y,Z}(Z,X,Y)$ reads “plug in $Z$ for $X$, $X$ for $Y$, and $Y$ for $Z$.” The last case is equivalent to the second, and much like <a href="https://www.w3schools.com/python/gloss_python_function_keyword_arguments.asp">keyword argument syntax in Python</a>, but with the benefit of being notationally primitive rather than relying on the <em>function factory</em> convention $f_{X_1,X_2,X_3,\ldots}(x_1, x_2, x_3, \ldots) = f(X_1=x_1, X_2=x_2, X_3=x_3,\ldots)$.</p>
<p>It is worth noting that probability notation can be used correctly without too much trouble. Theoretical statistics and mathematics texts tend to have good examples of correct usage.</p>
<h3 id="motivation-3-construct-events-that-are-guaranteed-measurable"><a class="header-anchor" href="#motivation-3-construct-events-that-are-guaranteed-measurable">Motivation 3: Construct events that are guaranteed measurable</a></h3>
<p>Using random variable $X : \Omega \to F$ inside set-builder notation will guarantee that the result is an event, i.e. an element of $E$. For example, <script type="math/tex">\{\omega \in \Omega \mid X(\omega) \in A\} \in E</script> as long as $X^{-1}(A) \in E$. We specified in the definition of random variable that it be a <em>measurable</em> function, which is a fancy way of saying that we restrict ourselves to such $A \subseteq F$ where $X^{-1}(A) \in E$ holds.</p>
<p><span class="advanced outer hidden"><span class="advanced inner hidden">
The definition of random variable specifies that the function $X : \Omega \to F$ is <em>measurable</em>. That means for measurable spaces $(\Omega, E)$ and $(F, \mathcal{F})$, it is the case that <span class="marginnote-outer"><span class="marginnote-ref"><script type="math/tex">X^{-1}(A) \in E,\ \forall A \in \mathcal{F}</script>.</span><label for="f5bfb7f5110efa973669d06b6cf8443929676dd1" class="margin-toggle"> ⊕</label><input type="checkbox" id="f5bfb7f5110efa973669d06b6cf8443929676dd1" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Where <script type="math/tex">X^{-1}(A) = \{\omega \in \Omega \mid X(\omega) \in A\}</script> is the pre-image of <script type="math/tex">X</script> on <script type="math/tex">A</script>.</span></span></span> In other words, $X$ <span class="marginnote-outer"><span class="marginnote-ref">never maps a non-measurable subset</span><label for="09f5e06fa264194f74f5ebadcfbdeb1baba486aa" class="margin-toggle"> ⊕</label><input type="checkbox" id="09f5e06fa264194f74f5ebadcfbdeb1baba486aa" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">However, $X$ could map a measurable subset of $\Omega$ to a non-measurable subset of $F$.</span></span></span> of $\Omega$ to a measurable subset of $F$. Thus every set of the form <script type="math/tex">\{\omega \in \Omega \mid X(\omega) \in A\} = X^{-1}(A)</script> for measurable <script type="math/tex">A \in \mathcal{F}</script> is guaranteed to be measurable.</span></span></p>
<p><span class="advanced outer hidden"><span class="advanced inner hidden"><strong>Question:</strong> Are arbitrary expressions of random variables, i.e. $\mathrm{condition}(X_1, X_2, \ldots)$, guaranteed measurable?</span></span></p>
<h1 id="almost-surely"><a class="header-anchor" href="#almost-surely">Almost surely</a></h1>
<p>We know that $P(\emptyset) = 0$. It is possible (and common) to have non-empty events which have probability zero. Since we are calling $P$ a <em>measure</em> of probability (analogous to the size of a set), then we say that a set $e$ where $P(e) = 0$ has measure 0. Such an event is said to occur <strong>almost never</strong>.</p>
<p>We also know that $P(\Omega) = 1$. In the situations where non-empty sets have measure 0, there must be non-$\Omega$ sets with measure 1, because of the additivity of probability measure. Such sets are said to have measure 1, and such events are said to occur <a href="https://en.wikipedia.org/wiki/Almost_surely"><strong>almost surely</strong></a>.</p>
<p>There is nothing strange about non-empty sets of measure 0. Probability measure is not measuring the number of samples in an event (that would be set cardinality). If $P(e) = 0$, then for any sub-event $e’ \subset e$ we have $P(e’) = 0$ by additivity of probability measure. So if $\omega \in e$, then <script type="math/tex">P(\{\omega\}) = 0</script>. We could say informally that sample $\omega$ <span class="marginnote-outer"><span class="marginnote-ref">has</span><label for="8b6a113f6e09785a059e28fd0bd407e1177231b2" class="margin-toggle"> ⊕</label><input type="checkbox" id="8b6a113f6e09785a059e28fd0bd407e1177231b2" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">While recognizing that formally samples don’t have probability, and it is the event <script type="math/tex">\{\omega\}</script> which has probability 0.</span></span></span> 0 probability.</p>
<p><strong>Question:</strong> What does <script type="math/tex">P(\{\omega\}) = 0</script> imply about $\omega$? Does it mean that $\omega$ can never be the case, i.e. can never be a state of the world?</p>
<p>This is a question about the interpretation of probability, i.e. how probability theory interfaces with reality, and there is no universally agreed upon answer. The mathematical construction of probability theory is agnostic on the matter.</p>
<p>I think there are two follow up questions that naturally fall out of the original:</p>
<ol>
<li>For what reason would we define a probability measure $P$ such that <script type="math/tex">P(\{\omega\}) = 0</script> for some $\omega \in \Omega$?</li>
<li>If we are told $P$ describes some physical process and <script type="math/tex">P(\{\omega\}) = 0</script>, what will we observe?</li>
</ol>
<p>Naive answers to both are that we may assign measure 0 to events which can never be observed to occur, and if we believe an event has measure 0 then we will never observe it occurring. There are some who will say that nothing is impossible, merely improbable, and all events should be assigned non-zero probability. Clearly “no confirmation ⟹ impossible” is the <span class="marginnote-outer"><span class="marginnote-ref"><a href="https://en.wikipedia.org/wiki/Black_swan_theory">black swan fallacy</a>,</span><label for="b1edfaa9f6795859151d1b1f2a83d8d9aa8f7daa" class="margin-toggle"> ⊕</label><input type="checkbox" id="b1edfaa9f6795859151d1b1f2a83d8d9aa8f7daa" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Black swans were presumed to not exist by Europeans before the 16th century because only white swans had been observed. “However, in 1697, Dutch explorers led by Willem de Vlamingh became the first Europeans to see black swans, in Western Australia.” The fallacy is that lack of confirmation of something being true does not rule out the possibility that it is true. This fallacy amounts to mistaking ‘I have not found $x$ s.t. $\mathrm{proposition}(x)$’ with ‘$\not\exists x$ s.t. $\mathrm{proposition}(x)$’.</span></span></span>. You cannot know something is impossible by lack of observation, so you should not assign 0 probability because of lack of data. However, something may be logically impossible, or you may know something is impossible via other means.</p>
<p>Question #1 is a special case of the <a href="https://en.wikipedia.org/wiki/Inverse_probability">inverse probability problem</a>, which is the problem of determining the probability measure (distribution) that best describes some physical process (e.g. a game, physical experiment, stock market). Is there a 1-to-1 mapping between physical processes and probability distributions? In other words, is the distribution that best describes a physical process objective and unique, i.e. <span class="marginnote-outer"><span class="marginnote-ref">independently verifiable.</span><label for="2bc94f5ae10cf6a370cb757081fe65bb96d517c0" class="margin-toggle"> ⊕</label><input type="checkbox" id="2bc94f5ae10cf6a370cb757081fe65bb96d517c0" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In the same way that scientific experiments can be reproduced and verified by independent parties. If the reason for selecting measure $P_1$ over measure $P_2$ to describe a physical process is not dogmatic, then that choice should be independently arrived at from first principles by multiple parties.</span></span></span></p>
<p>There is at this time no good answer to the inverse probability problem. Kolmogorov developed his definition of probability to match the mathematical intuitions on probability of his predecessors going back to the <span class="marginnote-outer"><span class="marginnote-ref">17th century.</span><label for="47efecfe1101c6112ef47dad13c7a5d56659bc89" class="margin-toggle"> ⊕</label><input type="checkbox" id="47efecfe1101c6112ef47dad13c7a5d56659bc89" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Famously the <a href="https://en.wikipedia.org/wiki/Problem_of_points">problem of points</a> is an example of early probability calculation.</span></span></span> But what gave rise to this persistent intuition that the whole world should be described with probability, and that probability values should represent randomness and unpredictability? That I do not have an answer to, but I found Ian Hacking’s <a href="https://en.wikipedia.org/wiki/The_Emergence_of_Probability">The Emergence of Probability</a> to give a good account of the historical emergence of probability theory.</p>
<p>Not only is probability theory agnostic on the meaning of 0 probability, it doesn’t actually have anything to say about what it means for an outcome to be likely or unlikely, or expected or unexpected in the colloquial sense, at least not in a non-circular way. If we observe a 100 coin tosses all come up heads, I might say it was a fair coin and the tosser just got lucky/unlucky, and you might say the coin tosses were rigged and the probability of this outcome was clearly close to 1. Whose to say which probabilistic description of the physical setup is correct, unless there is some theory to tell us what probability distributions describe what physical systems, and thus what experiment we could do to see <span class="marginnote-outer"><span class="marginnote-ref">who is correct</span><label for="d98b8bb83b2ec1cfb46d3d878644310d083fb613" class="margin-toggle"> ⊕</label><input type="checkbox" id="d98b8bb83b2ec1cfb46d3d878644310d083fb613" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We do hold a lot of intuitions about this correspondence between the physical realm and probability. For example, symmetries should correspond to equiprobable outcomes. Most people will agree that if the coin were asymmetric in some way that could be a cause for it to come up one way more often. But how much more often? This is where things get fuzzy. In general, how do you determine the precise probability of heads from a model of coin tossing?</span></span></span>. This is out of scope of probability theory. Kolmogorov’s axioms merely ensure that probability is self-consistent within the realm of mathematics.</p>
<p>Kolmogorov himself tried to fix this shortcoming which led to the development of <a href="http://www.scholarpedia.org/article/Algorithmic_information_theory">algorithmic information theory</a>. In <a href="https://www.sciencedirect.com/science/article/pii/S0304397598000759?via%3Dihub">On tables of random numbers</a> he writes:</p>
<blockquote>
<p>… for a long time I had the following views:<br />
(1) The frequency concept based on the notion of limiting frequency as the number of trials increases to infinity, does not contribute anything to substantiate the applicability of the results of probability theory to real practical problems where we have always to deal with a finite number of trials.<br />
(2) The frequency concept applied to a large but finite number of trials does not admit a rigorous formal exposition within the framework of pure mathematics.</p>
</blockquote>
<h2 id="throwing-darts"><a class="header-anchor" href="#throwing-darts">Throwing darts</a></h2>
<p><a href="#examples">Above</a> I gave the reals as an example of a sample set. It is not hard to show that <a href="https://proofwiki.org/wiki/Countable_Sets_Have_Measure_Zero">every countable subset of the reals must have measure 0</a>. This gives rise to the classic conundrum that any particular number sampled from the real line (under, say, a Gaussian pdf) will have 0 probability of occurring. Or <span class="marginnote-outer"><span class="marginnote-ref">more poetically</span><label for="2b21c5bb409b3889130ada4524bd9d8231827510" class="margin-toggle"> ⊕</label><input type="checkbox" id="2b21c5bb409b3889130ada4524bd9d8231827510" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">This is just the same thought experiment but in $\real^2$.</span></span></span>, throw a dart at a dart board, and wherever it lands there is 0 probability of it doing so.</p>
<p>My response is two-fold. In the case of the dart board, since we are invoking a physical process, I argue that there are only finitely many distinguishable places the dart can land, limited by the precision of our measurement apparatus (e.g. a camera). I assert that we can only ever have finite precision on measurements (see my <a href="http://zhat.io/articles/primer-shannon-information#proof-that-mi-is-fininte-for-continuous-distributions">discussion on mutual information</a>). For this reason, event sets for physical processes are functionally finite, even if the sample set is infinite.</p>
<p>Probability theory gives us an elegant way to model a physical process with continuous state while simulating measurements of finite precision. This brings me to the real line example. Assuming we have a probability density function with <a href="https://en.wikipedia.org/wiki/Support_(mathematics)">support everywhere</a>, for both the dart board and real line, the measure of intervals that are not just points will be non-zero, because such intervals are uncountable sets. So choosing event intervals which correspond to measurement error bounds will produce events with non-zero probability. In short, you are taking the probability of a physical measurement outcome, not a <span class="marginnote-outer"><span class="marginnote-ref">state of the world!</span><label for="ccb568173abda2dfc430f6520595779078f92bbc" class="margin-toggle"> ⊕</label><input type="checkbox" id="ccb568173abda2dfc430f6520595779078f92bbc" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We could say states of the world are not directly accessible, but are only indirectly observable through finite measurement precision.</span></span></span> <span class="marginnote-outer"><span class="marginnote-ref">Singleton events</span><label for="707da6d71c44e8f065de71fff0ac04b56c5c26e2" class="margin-toggle"> ⊕</label><input type="checkbox" id="707da6d71c44e8f065de71fff0ac04b56c5c26e2" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Really any event containing finite or countably many samples in a sense is an infinite precision measurement, and conveys infinite information.</span></span></span> on $\real$ have essentially infinite precision, and you are in a sense <span class="marginnote-outer"><span class="marginnote-ref">“paying for” more precision</span><label for="502417c6d8b295ed75b255687e16dbd4f8d0cea7" class="margin-toggle"> ⊕</label><input type="checkbox" id="502417c6d8b295ed75b255687e16dbd4f8d0cea7" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">There is a direct connection between precision and information. More precision means more bits. Infinite precision means infinite information, and 0 probability. This is why the <a href="http://zhat.io/articles/primer-shannon-information#shannon-information-for-continuous-distributions">entropy of most distributions on $\real$ is infinite</a>.</span></span></span> in your events with increasingly small probabilities. At the limit, you pay for infinite precision with 0 probability.</p>
<h2 id="borels-law-of-large-numbers"><a class="header-anchor" href="#borels-law-of-large-numbers">Borel’s law of large numbers</a></h2>
<p>A classical interpretation of probability is that it represents the frequency of occurrence of some event in a repeatable process as the number of repetitions goes to infinity. This is sometimes called the <strong>frequentist</strong> interpretation of probability.</p>
<p><em>Repeatable</em>, in the language of probability theory, means <strong>independently and identically distributed</strong> (i.i.d.). That is, for RVs $X_1, X_2, \ldots$ their marginals are equal, $P_{X_1} = P_{X_2} = \ldots$ (i.e. identical), and their joint distribution is the product of marginals, $P_{X_1, X_2, \ldots}(A) = P_{X_1}(A)\cdot P_{X_2}(A) \cdot \ldots$ (i.e. <a href="https://en.wikipedia.org/wiki/Independence_(probability_theory)">independent</a>).</p>
<p>We have two problems:</p>
<ol>
<li>What does it mean for a physical process to be i.i.d.?</li>
<li>What does it mean to draw from a probability distribution more than once?</li>
</ol>
<p>The first is an open question. E.T. Jaynes in his <a href="https://www.cambridge.org/core/books/probability-theory/9CA08E224FF30123304E6D8935CF1A99">Logic of Science</a> argues that i.i.d. is never a reasonable description of physical systems:</p>
<blockquote>
<p>Such a belief is almost never justified, even for the fairly well-controlled measurements of the physicist or engineer, not only because of unknown systematic error, but because successive measurements lack the logical independence required for these limit theorems to apply.</p>
</blockquote>
<p>Consider two coin tosses. What makes them independent outcomes? We have an intuition that they are not causally connected and therefor they don’t share information, i.e. you cannot predict the outcome of one coin any better given the outcome of the other. There is a sort of paradox at the heart of probability theory, where an event with probability between 0 and 1 necessarily implies lack of understanding of the process behind that event. If you knew completely how a process gives rise to any particular outcome, then you could just <span class="marginnote-outer"><span class="marginnote-ref">model that process without probability</span><label for="292fc0dd509009c0a60ec63bb6c57ba411f69970" class="margin-toggle"> ⊕</label><input type="checkbox" id="292fc0dd509009c0a60ec63bb6c57ba411f69970" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For example, these papers modeling coin tossing:<br />‣ <a href="https://statweb.stanford.edu/~susan/papers/headswithJ.pdf">DYNAMICAL BIAS IN THE COIN TOSS</a><br />‣ <a href="https://arxiv.org/pdf/1008.4559.pdf">Probability, geometry, and dynamics in the toss of a thick coin</a><br />which move the probabilistic component of the model onto the initial conditions.</span></span></span>. So then, any model of the two coins that demonstrates why they do not share information would need to reveal their inner workings, thus going inside the physical black box delineated by probability. To understand why they are independent is to make their outcomes determined from a physicist’s “god-like perspective”, and in a sense non-probabilistic.</p>
<p>Regardless of the physical reality of i.i.d. processes, there is the mathematical question of how to represent i.i.d. repetitions of an experiment. Given $(\Omega, E, P)$ for our experiment and identity RV $X : \omega \mapsto \omega$, we can derive a larger distribution representing $n$ trials by taking the cartesian product of the sample space $n$ times, i.e. our probability space is $(\Omega_n, E_n, P_n)$ where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Omega_n &:= \underbrace{\Omega \times \Omega \times \ldots \times \Omega}_{n\ \mathrm{times}} \\
E_n &:= \underbrace{E \otimes E \otimes \ldots E}_{n\ \mathrm{times}} \\
P_n &: (e_1, \ldots, e_n) \mapsto \prod_{i=1}^n P(e_i)\,.
\end{align} %]]></script>
<p>Ignoring the mathematical difficulties involved, let’s invoke the sample set over infinite trials, $\Omega_\infty$. Let’s also create a random variable for the outcome of each trial <script type="math/tex">t \in \nat\setminus\{0\}</script> in the infinite series:</p>
<script type="math/tex; mode=display">X_t : \Omega_\infty \to \Omega : (\omega_1, \omega_2, \ldots, \omega_t, \ldots) \mapsto \omega_t\,.</script>
<p>The idea of probability representing the outcome frequency of infinite i.i.d. trials is formally captured by <span class="marginnote-outer"><span class="marginnote-ref"><a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law">Borel’s law of large numbers (BLLN)</a></span><label for="f8c0e2c3419ca4b7592f9b45eebbe3fc48accdc3" class="margin-toggle"> ⊕</label><input type="checkbox" id="f8c0e2c3419ca4b7592f9b45eebbe3fc48accdc3" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">This is a special case of the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law">strong law of large numbers</a>. There are a few variants of the law of large numbers (LLN), e.g. <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Weak_law">weak law</a>, but I feel BLLN most straightforwardly expresses the insight I wish to convey.</span></span></span>. Given any single-trial event $e \in E$, we have:</p>
<p><script type="math/tex">P_\infty\left(\left\{\omega_\infty \in \Omega_\infty \bigmid \lim_{n \to \infty} \frac{1}{n} \sum\limits_{i=1}^n 𝟙[X_i(\omega_\infty) \in e] = P(e)\right\}\right) = 1\,,</script><br />
where $𝟙[\mathrm{expr}]$ casts boolean $\mathrm{expr}$ to an integer (1 if true, 0 otherwise). The sum</p>
<script type="math/tex; mode=display">\sum\limits_{i=1}^n 𝟙[X_i(\omega_\infty) \in e]</script>
<p>computes a count: the number of times event $e$ occurs in the first $n$ trials, where $\omega_\infty$ is the infinite sequence of trial samples. Dividing by $n$ gives the frequency, i.e. fraction of times $e$ appears out of the first $n$ trials.</p>
<p>Borel’s law of large numbers (BLLN) can then be written more concisely using our fun RV notation:</p>
<script type="math/tex; mode=display">P_\infty\left(\lim_{n \to \infty} \frac{1}{n} \sum\limits_{i=1}^n 𝟙[X_i \in e] = P(e)\right) = 1\,,</script>
<p>or using <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables#Almost_sure_convergence">almost sure convergence notation</a>:</p>
<script type="math/tex; mode=display">\frac{1}{n} \sum\limits_{i=1}^n 𝟙[X_i \in e] \overset{\mathrm{a.s.}}{\longrightarrow} P(e)\,,</script>
<p>though the latter does not make it clear that $P_\infty$ is our measure.</p>
<p>This equation is very intriguing, as it directly relates samples from $P_\infty$ to measure $P$. In short, BLLN states that there is a measure 1 set of infinite sequences of i.i.d. trials s.t. the limiting number of occurrences of event $e \in E$ as a fraction of the total number of trails is exactly $P(e)$. The implication is that <span class="marginnote-outer"><span class="marginnote-ref">almost surely</span><label for="808c2858de3d6d2f40c660f42492fb70f8369082" class="margin-toggle"> ⊕</label><input type="checkbox" id="808c2858de3d6d2f40c660f42492fb70f8369082" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">For a measure 1 subset of samples in $\Omega_\infty$, of which each sample is itself an infinite sequence of single-trial samples.</span></span></span> we can infer $P$ from <span class="marginnote-outer"><span class="marginnote-ref">just one sample</span><label for="298b90b7dd6823ffb37aa5cdbc6eeb109ba14086" class="margin-toggle"> ⊕</label><input type="checkbox" id="298b90b7dd6823ffb37aa5cdbc6eeb109ba14086" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Technically the singleton event containing just that sample.</span></span></span> of an infinite sequence of trials, thus apparently solving the inverse probability problem (almost surely) for the i.i.d. case.</p>
<p>As I mentioned earlier, countable events of real numbers are always measure 0 (<a href="https://proofwiki.org/wiki/Countable_Sets_Have_Measure_Zero">proof</a>) for probability measures defined on the reals. Sample set $\Omega_\infty$ has the cardinality of $\real$, and there is a <span class="marginnote-outer"><span class="marginnote-ref">natural bijection to the unit interval</span><label for="3ab93fbd33cfedf7956143fb287aa3dbb5c5101c" class="margin-toggle"> ⊕</label><input type="checkbox" id="3ab93fbd33cfedf7956143fb287aa3dbb5c5101c" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">If the sample space $\Omega$ of each trial is finite, we can think of a sequence $(\omega_1, \omega_2, \ldots)$ as the decimal expansion of a number between 0 and 1 in base $\abs{\Omega}$.</span></span></span>. Therefore there are potentially an infinity of events in $\Omega_\infty$ (countably many) for which BLLN does not hold. As before we may ask a similar question: can these BLLN-violating events happen?</p>
<p>Let’s step back and ask, what is so special about the BLLN anyway? Why should samples satisfy it? In fact, for any particular sample $\omega_\infty$, I can construct a measure 1 set <script type="math/tex">\Omega_\infty \setminus \{\omega_\infty\}</script> which does not contain it, simply because the singleton set <script type="math/tex">\{\omega_\infty\}</script> has measure 0. Thus it seems that for any sample, there is a “law” which states that it <em>amost surely</em> does not occur. In essence, all samples are special, or none are.</p>
<p>Ming Li and Paul Vitányi in their <a href="https://link.springer.com/book/10.1007%2F978-3-030-11298-1">An Introduction to Kolmogorov Complexity and Its Applications</a> summarize this conundrum quite well:</p>
<blockquote>
<p>We call a sequence ‘random’ if it is ‘typical.’ It is not ‘typical,’ say ‘special,’ if it has a particular distinguishing property. An example of such a property is that an infinite sequence contains only finitely many ones. There are infinitely many such sequences. But the probability that such a sequence occurs as the outcome of fair coin tosses is zero. ‘Typical’ infinite sequences will have the converse property, namely, they contain infinitely many ones.</p>
</blockquote>
<blockquote>
<p>In fact, one would like to say that ‘typical’ infinite sequences will have all converse properties of the properties that can be enjoyed by ‘special’ infinite sequences. This is formalized as follows: If a particular property, such as containing infinitely many occurrences of ones (or zeros), the law of large numbers, or the law of the iterated logarithm, has been shown to have probability one, then one calls this a law of randomness. A sequence is ‘typical,’ or ‘random,’ if it satisfies all laws of randomness.</p>
</blockquote>
<blockquote>
<p>But now we are in trouble. Since all complements of singleton sets in the sample space have probability one, it follows that the intersection of all sets of probability one is empty. Thus, there are no random infinite sequences!</p>
</blockquote>
<p>An elegant solution to this conundrum was discovered by <a href="http://www.nieuwarchief.nl/serie5/pdf/naw5-2018-19-1-044.pdf">Per Martin-Löf</a>, which <span class="marginnote-outer"><span class="marginnote-ref">restricts $P$ to be <a href="https://en.wikipedia.org/wiki/Computable_function">computable</a></span><label for="9d0d67e30e57d6a3fe027c7ed68c9f0b597b6bb7" class="margin-toggle"> ⊕</label><input type="checkbox" id="9d0d67e30e57d6a3fe027c7ed68c9f0b597b6bb7" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">It can be argued that all feasibly usable probability measures are necessarily computable, and so this is not really a restriction at all.</span></span></span>, but that is unfortunately out of scope for this post (I hope to write a future post on Martin-Löf’s solution).</p>
<h1 id="primer-to-measure-theory"><a class="header-anchor" href="#primer-to-measure-theory">Primer to measure theory</a></h1>
<p>Congratulations! You’ve reached end of this post. <button class="advanced-button">Click here</button> (or on any <span class="advanced outer hidden"><span class="advanced inner hidden">purple block</span></span>) to unlock the <span class="advanced outer hidden"><span class="advanced inner hidden">purple text</span></span> on measure theory above. After reading this section, return to the earlier sections and take in the finer precision and details offered by your new found understanding of measure theory.</p>
<p>Terence Tao, in <a href="https://terrytao.files.wordpress.com/2011/01/measure-book1.pdf">An Introduction to Measure Theory</a>, motivates measure theory, saying:</p>
<blockquote>
<p>One of the most fundamental concepts in Euclidean geometry is that of the measure $m(E)$ of a solid body $E$ in one or more dimensions. In one, two, and three dimensions, we refer to this measure as the length, area, or volume of $E$ respectively.<br />
… The physical intuition of defining the measure of a body $E$ to be the sum of the measure of its component “atoms” runs into an immediate problem: a typical solid body would <span class="marginnote-outer"><span class="marginnote-ref">consist of an infinite (and uncountable) number of points</span><label for="2eba793a1e3bccec99c46511d5bb89c632d569b3" class="margin-toggle"> ⊕</label><input type="checkbox" id="2eba793a1e3bccec99c46511d5bb89c632d569b3" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">He is referring to the mathematical ideal of a body being composed of a set of 0-dimensional points.</span></span></span>, each of which has a measure of zero; and the product $\infty \cdot 0$ is indeterminate. To make matters worse, two bodies that have exactly the same number of points, need not have the same measure. For instance, in one dimension, the intervals $A := [0, 1]$ and $B := [0, 2]$ are in one-to-one correspondence (using the bijection $x \mapsto 2x$ from $A$ to $B$), but of course $B$ is twice as long as $A$. So one can disassemble $A$ into an uncountable number of points and reassemble them to form a set of twice the length.</p>
</blockquote>
<p>Terence also mentions the <a href="https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox">Banach-Tarski paradox</a> which shows that even finitely many partitions of a sphere (only 5 are needed!) can be rearranged into two spheres. These kinds of non-measure-preserving sets are always going to be pathological, so the solution is to disallow measurement of these pathological sets. We call those sets <em>non-measurable</em>. If you are curious what non-measurable sets are like, Terence talks about them in section 1.2.3. In the case of the Banach-Tarski paradox, these sets look like fuzzy balls with infinitely many holes in them. The <a href="https://www.youtube.com/watch?v=s86-Z-CbaHA">video on Banach–Tarski by vsauce</a> gives a good visual depiction.</p>
<p>I will not go into how measurable sets can be defined. There are many approaches, the most common of which is due to <a href="https://en.wikipedia.org/wiki/Lebesgue_measure">Lebesgue</a> (Tao section 1.3). It suffices to say that you cannot have all subsets of $\real$ be measurable without giving up <a href="https://en.wikipedia.org/wiki/Non-measurable_set#Consistent_definitions_of_measure_and_probability">desirable properties of <em>measure</em></a>, e.g. that rearranging and rotating disjoint sets does not change their cumulative measure. In what follows, I’m going to assume that for some set $\Omega$ of any cardinality (finite, countable, uncountable, etc.), we just so happen to be in possession of a reasonable set of measurable sets $E \subseteq 2^\Omega$ and the associated measure $P$. Read Terry’s book for details on how to construct such things. I’m merely going to run through the important definitions and terminology pertaining to probability theory, using the naming conventions of probability theory rather than measure theory.</p>
<p>Let $\Omega$ be some set of any cardinality (finite, countable, uncountable, etc.). Assume we are in possession of the set of all measurable subsets $E \subseteq 2^\Omega$, and $P$ is a <strong>measure</strong>. The triple $(\Omega, E, P)$ is called a <strong>measure space</strong>. $(\Omega, E)$ is a <strong>measurable space</strong> (where no measure is specified). Any set $e \in E$ is called <strong>measurable</strong> and $e’ \notin E$ is called <strong>non-measurable</strong>. The signature of $P$ is $E \to \real$, and so it maps only measurable sets to real numbers representing the measures (sizes) of those sets.</p>
<p>There are a few requirements for $P$ that make it behave like a measure. Repeated from <a href="#definitions">above</a>, they are:</p>
<ul>
<li><strong>Non-negativity</strong>: <script type="math/tex">P(e) \geq 0,\ \forall e \in E</script>.</li>
<li><strong>Null empty set</strong>: <script type="math/tex">P(\emptyset) = 0</script>.</li>
<li><strong>Countable additivity</strong>: For any countable <script type="math/tex">A \subseteq E</script> where <script type="math/tex">\bigcap A = \emptyset</script>, <script type="math/tex">P(\bigcup A) = \sum P(A)</script>, where <script type="math/tex">P(A) = \{P(e) \mid e \in A\}</script>.</li>
</ul>
<p><a name="sigma-algebra"></a><span class="jump_to">
Further, $E$ is required to be a <span class="marginnote-outer"><span class="marginnote-ref"><strong>$\sigma$-algebra</strong></span><label for="c317d919018120cca3d580ae386d9ab852907363" class="margin-toggle"> ⊕</label><input type="checkbox" id="c317d919018120cca3d580ae386d9ab852907363" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Following following Tao, section 1.4.2. For further information see <a href="https://en.wikipedia.org/wiki/%CE%A3-algebra">Wikipedia</a>.</span></span></span>, which means it satisfies:</span></p>
<ul>
<li><strong>Empty set</strong>: $\emptyset \in E$.</li>
<li><strong>Complement</strong>: If $e \in E$, then the complement $e^c := \Omega \setminus e$ is also in $E$.</li>
<li><strong>Countable unions</strong>: If $e_1, e_2, \ldots \in E$ then $\bigcup_{n=1}^\infty e_n \in E$.</li>
</ul>
<p>What this all amounts to is that our measure is always non-negative, the empty set is measurable with a measure of 0, compliments and countable unions of measurable sets are measurable, and measure is additive (i.e. sum of measures of disjoint sets equals the measure of the union of those sets).</p>
<p>There’s one more kind of object that probability theory makes heavy use of: the measurable function. Recounting the definition I gave <a href="#motivation-3-construct-events-that-are-guaranteed-measurable">earlier</a>, given two measurable spaces $(A, \mathcal{A})$ and $(B, \mathcal{B})$, a <strong>measurable function</strong> $X : A \to B$ satisfies</p>
<script type="math/tex; mode=display">X^{-1}(b) \in \mathcal{A},\ \forall b \in \mathcal{B}\,,</script>
<p>where <script type="math/tex">X^{-1}(b) = \{\alpha \in A \mid X(\alpha) \in B\}</script> is the pre-image of $X$ on $b \subseteq B$. $X$ never maps a non-measurable subset of $A$ to a measurable subset of $B$, but $X$ could map a measurable subset of $A$ to a non-measurable subset of $B$. We only care about the reverse direction, and it becomes apparent why in the <a href="#motivation-3-construct-events-that-are-guaranteed-measurable">section on random variables</a>.</p>
<p>A <strong>probability measure</strong> is a measure s.t. $P(\Omega) = 1$, i.e. the measure of the entire set $\Omega$ is bounded and equals 1.</p>
Fri, 19 Jun 2020 00:00:00 -0700
pragmanym.github.io/zhat/articles/primer-probability-theory
pragmanym.github.io/zhat/articles/primer-probability-theorypostNotes: Probability & AI Curriculum<p>This is a snapshot of my curriculum for exploring the following questions:</p>
<ul>
<li>Is probability theory all you need to develop AI?
<ul>
<li>If not, what is missing?</li>
</ul>
</li>
<li>Should a theory of AI be expressed in the framework of probability theory at all?</li>
<li>Do Brains use probability?</li>
</ul>
<!--more-->
<p>This reflects my current estimate of the landscape, and summarizes where my interests and aspirations have taken me so far. It is not set in stone. I may follow through on it, or I may diverge as I learn more. I primarily follow the current of my curiosity.</p>
<figure><img src="/assets/posts/probability-ai-curriculum/topic-tree.svg" alt="Visualization of topic tree. Nodes are organized hierarchically be level of abstraction, with dotted-lines representing non-hierarchical associations. Colors designate hierarchy-level. Made with <a href="https://www.yworks.com/yed-live/">https://www.yworks.com/yed-live/</a>" width="100%" /><figcaption>Visualization of topic tree. Nodes are organized hierarchically be level of abstraction, with dotted-lines representing non-hierarchical associations. Colors designate hierarchy-level. Made with <a href="https://www.yworks.com/yed-live/">https://www.yworks.com/yed-live/</a></figcaption></figure>
<h1 id="description-of-topics"><a class="header-anchor" href="#description-of-topics">Description of topics</a></h1>
<p>Here are the topics from the graph above, with descriptions to the extent that I understand them, and links to reference material.</p>
<ul>
<li>
<dl>
<dt><strong>Objective probability</strong></dt>
<dd>Is probability an objective property of physical systems in general (not just i.i.d.)? Objective, meaning independently arrived at by multiple parties, like a scientific experiment (just as mass and energy measurements can be independently verified) - i.e. not dependent on a particular brain with particular beliefs. If p(x) = θ, then this is true even if no humans are around at all to believe it. The main problem in making probability objective is figuring out how to uniquely determine the probability of something given observations. What needs to be measured in order to ascertain the objective probability of a system?</dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Solomonoff induction</strong></dt>
<dd>A Bayesian inference setup general enough to encompass general intelligence. Posterior converges to the true data posterior at the infinite limit (for any prior with support everywhere), possibly providing an objective notion of probability, at least for infinite sequences.<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a><br />
‣ <a href="https://arxiv.org/abs/cs/0305052">On the Existence and Convergence Computable Universal Priors</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Approximations</strong></dt>
<dd>How can SI be implemented in practice? How would brains implement it?<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm#approx">http://www.hutter1.net/ai/uaibook.htm#approx</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Posterior convergence</strong></dt>
<dd>The sense in which Solomonoff induction is objective. The predicted posterior converges to the true data posterior with infinite observations, for any prior with support over all hypotheses.<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a>, Theorem 3.19</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Posterior consistency</strong></dt>
<dd>Solomonoff induction may not be consistent, meaning it cannot distinguish between any two hypotheses with infinite data. Implications for objective probability.</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Prior with universally optimal convergence</strong></dt>
<dd>Solomonoff’s universally optimal prior.<br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a>, Theorem 3.70</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Convergence on individual sequences</strong></dt>
<dd>Convergence of Solomonoff induction is not guaranteed on a measure-0 set of sequences. Construction of such a sequence.<br />
‣ <a href="https://arxiv.org/abs/cs/0407057">Universal Convergence of Semimeasures on Individual Random Sequences</a>, Theorem 6 and Proposition 12</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>(Non-)Equivalence of Universal Priors</strong></dt>
<dd>A surprising equivalence between mixtures of deterministic programs and computable distributions.<br />
‣ <a href="https://arxiv.org/abs/1111.3854">(Non-)Equivalence of Universal Priors</a>, Theorem 14</dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Martin-Lof randomness</strong></dt>
<dd>What it means for an infinite sequence to be drawn from a probability distribution. Algorithmic definition of randomness (see AIT).<br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a></dd>
</dl>
<ul>
<li><strong>Definition in terms of universal probability</strong><br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a><br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a></li>
<li><strong>Can sequences can be Martin-Lof random w.r.t. multiple probability measures?</strong></li>
</ul>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Bayesian epistemology</strong></dt>
<dd>Are priors and posteriors all that is needed for a complete theory of knowledge, and are a sufficient framework for building an intelligent system? Bayesian epistemology repurposes probability as a property of the intelligent agent doing the observing, rather than the system being observed (or perhaps it characterizes their interaction), i.e. probability as belief.</dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Bayesian brain hypothesis</strong></dt>
<dd>Hypothesis in neuroscience that the Brain is largely an approximate Bayesian inference engine.<br />
‣ <a href="https://pubmed.ncbi.nlm.nih.gov/15541511/">The Bayesian Brain: The Role of Uncertainty in Neural Coding and Computation</a><br />
‣ <a href="https://mitpress.mit.edu/books/bayesian-brain">Bayesian Brain: Probabilistic Approaches to Neural Coding</a><br />
‣ <a href="https://www.annualreviews.org/doi/full/10.1146/annurev.psych.55.090902.142005">Object Perception as Bayesian Inference</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Friston’s free energy principle</strong></dt>
<dd>A unified theory of biological intelligence from which Bayesian epistemology can be derived.<br />
‣ <a href="https://arxiv.org/abs/1901.07945">What does the free energy principle tell us about the brain?</a><br />
‣ <a href="https://www.fil.ion.ucl.ac.uk/~karl/The%20free-energy%20principle%20A%20unified%20brain%20theory.pdf">The free-energy principle: a unified brain theory?</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>How brains approximate Bayesian inference</strong></dt>
<dd>To make the Bayesian brain hypothesis falsifiable, a characterization of what counts as an approximation to Bayesian inference needs to be given. What approximate Bayesian computations in the brain have been found so far by neuroscientists? <em>Reference same sources listen under “Bayesian brain hypothesis”</em></dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Causal inference</strong></dt>
<dd>If Bayesian epistemology is not sufficient, then what is missing? Judea pearl proposes causal inference.<br />
‣ <a href="http://bayes.cs.ucla.edu/BOOK-2K/">Causality</a>, chapters 3 and 7<br />
‣ <a href="https://arxiv.org/abs/1305.5506">Introduction to Judea Pearl’s Do-Calculus</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Bounded Rationality</strong></dt>
<dd>What would Bayesian epistemology theoretically look like with bounded resources? Is Bayesian epistemology no longer optimal given bounded resources?<br />
‣ <a href="https://stanford.edu/~icard/BBRA.pdf">Bayes, Bounds, and Rational Analysis</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Logical justifications</strong></dt>
<dd>Arguments from first principles that Bayesian epistemology is a necessary condition for rationality, and that a rational agent is necessarily a Bayesian agent (such an agent is likely performing Solomonoff induction, in order for it to be sufficiently general in its prediction ability).</dd>
</dl>
<ul>
<li><strong>Dutch book argument</strong></li>
<li><strong>Complete classes</strong></li>
<li><strong>Cox’s theorem</strong></li>
<li><strong>Von Neumann-Morgenstern utility theorem</strong></li>
</ul>
</li>
<li>
<dl>
<dt><strong>Motivation from decision theory</strong></dt>
<dd>Some say a theory is good because it is useful. Perhaps the question “what theory of uncertainty should I use?” is best answered by looking at what we want to do with it, namely decision making under uncertainty. Bayesian epistemology can be motivated by of decision theory.<br />
‣ <a href="https://www.goodreads.com/book/show/1639056.The_Foundations_of_Statistics">The Foundations of Statistics</a>, chapter 3</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Unique priors</strong></dt>
<dd>How to choose a prior is one point of contention in Bayesian epistemology. There are some proposed methods for selecting a unique prior given what you already know, for example, the max-entropy principle.<br />
‣ <a href="https://arxiv.org/abs/1108.2120">Objective Priors: An Introduction for Frequentists</a><br />
‣ <a href="https://arxiv.org/pdf/0808.0012.pdf">LECTURES ON PROBABILITY, ENTROPY, AND STATISTICAL PHYSICS</a></dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Algorithmic information theory (AIT)</strong></dt>
<dd>An alternative to probability theory devised by Kolmogorov himself (and others) to address its shortcomings. Does AIT allow us to formalize the general learning problem of transferring knowledge out-of-distribution?<br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a><br />
‣ <a href="https://bookstore.ams.org/surv-220">Kolmogorov Complexity and Algorithmic Randomness</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Types of Kolmogorov complexity</strong></dt>
<dd>There is a constellation of algorithmic complexity functions that make up the foundation of AIT. <em>Reference same sources listen under “Algorithmic information theory”</em></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>Resource bounded complexities</strong></dt>
<dd>Kolmogorov complexity with bounded computation. Possible direction for computable-AIT.<br />
‣ <a href="https://www.springer.com/gp/book/9781489984456">An Introduction to Kolmogorov Complexity and Its Applications</a>, chapter 7</dd>
</dl>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>Algorithmic transfer learning</strong></dt>
<dd>How can the information shared by two datasets be defined? What is the objective of transfer learning?<br />
‣ <a href="http://users.cecs.anu.edu.au/~hassan/univTLTCS.pdf">On Universal Transfer Learning</a><br />
‣ <a href="https://papers.nips.cc/paper/3228-transfer-learning-using-kolmogorov-complexity-basic-theory-and-empirical-evaluations.pdf">Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations</a><br />
‣ <a href="https://arxiv.org/abs/1904.03292">The Information Complexity of Learning Tasks, their Structure and their Distance</a></dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>No free lunch theorem</strong></dt>
<dd>Theorem stating there is no universally best algorithm for all training-test dataset pairs.<br />
‣ <a href="https://www.cse.huji.ac.il/~shais/UnderstandingMachineLearning/">Understanding Machine Learning: From Theory to Algorithms</a>, Theorem 5.1</dd>
</dl>
</li>
</ul>
</li>
</ul>
</li>
<li>
<dl>
<dt><strong>AIXI</strong></dt>
<dd>A theory of optimal intelligence put forth by Marcus Hutter based on Solomonoff induction. <br />
‣ <a href="http://www.hutter1.net/ai/uaibook.htm">Universal Artifical Intelligence</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Data compression</strong></dt>
<dd>Lossless compression from the perspectives of Shannon’s information theory and AIT. Can they be unified? Can compression make probability objective? What is the relationship between compression and intelligence?<br />
‣ <a href="https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959">Elements of Information Theory</a><br />
‣ <a href="http://mattmahoney.net/dc/dce.html">Data Compression Explained</a></dd>
</dl>
</li>
<li>
<dl>
<dt><strong>Decision theory under ignorance</strong></dt>
<dd>Decision theory without probability. Pros and cons.<br />
‣ <a href="https://www.cambridge.org/core/books/an-introduction-to-decision-theory/B9EEB3DCE5D0CAFFB6F3F30B1D0A06A6">An Introduction to Decision Theory</a>, chapter 3</dd>
</dl>
</li>
<li>
<dl>
<dt><strong>The Fundamental Theorem of Statistical Learning (PAC)</strong></dt>
<dd>An introduction to PAC-learning theory. PAC is a probability-theory-based account of machine learning which AIT could replace.<br />
‣ <a href="https://www.cse.huji.ac.il/~shais/UnderstandingMachineLearning/">Understanding Machine Learning: From Theory to Algorithms</a>, Theorem 6.7</dd>
</dl>
<ul>
<li>
<dl>
<dt><strong>PAC account of transfer learning</strong></dt>
<dd>PAC analysis of transfer learning. However, assumptions about relatedness of tasks need to be made.<br />
‣ <a href="https://arxiv.org/abs/1106.0245">A Model of Inductive Bias Learning</a></dd>
</dl>
</li>
</ul>
</li>
</ul>
Wed, 17 Jun 2020 00:00:00 -0700
pragmanym.github.io/zhat/articles/probability-ai-curriculum
pragmanym.github.io/zhat/articles/probability-ai-curriculumnotesNotes: Dutch Book Argument<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#axioms-of-probability" id="markdown-toc-axioms-of-probability">Axioms of probability</a> <ul>
<li><a href="#axioms-of-probability-for-propositional-logic" id="markdown-toc-axioms-of-probability-for-propositional-logic">Axioms of probability for propositional logic</a></li>
</ul>
</li>
<li><a href="#i-if-not-bayesian--sure-loss-is-possible" id="markdown-toc-i-if-not-bayesian--sure-loss-is-possible">I. If not Bayesian ⟹ sure loss is possible</a> <ul>
<li><a href="#but-how-does-this-lead-to-bayes-rule" id="markdown-toc-but-how-does-this-lead-to-bayes-rule">But how does this lead to Bayes rule?</a></li>
<li><a href="#but-is-real-life-a-series-of-bets" id="markdown-toc-but-is-real-life-a-series-of-bets">But is real-life a series of bets?</a></li>
</ul>
</li>
<li><a href="#ii-if-bayesian--sure-loss-is-not-possible" id="markdown-toc-ii-if-bayesian--sure-loss-is-not-possible">II. If Bayesian ⟹ sure loss is not possible</a></li>
</ul>
<p>Main source: <a href="https://plato.stanford.edu/entries/dutch-book/">Dutch Book Arguments (SEP)</a></p>
<p>Dutch Book Theorem:</p>
<blockquote>
<p>Given a set of betting quotients that fails to satisfy the probability axioms, there is a set of bets with those quotients that guarantees a net loss to one side.</p>
</blockquote>
<p>Converse Dutch Book Theorem:</p>
<blockquote>
<p>For a set of betting quotients that obeys the probability axioms, there is no set of bets (with those quotients) that guarantees a sure loss (win) to one side.</p>
</blockquote>
<p><a href="https://www.stat.berkeley.edu/~census/dutchdef.pdf">https://www.stat.berkeley.edu/~census/dutchdef.pdf</a>:</p>
<blockquote>
<p>Dutch book cannot be made against a Bayesian bookie.</p>
</blockquote>
<p>I. If not Bayesian ⟹ sure loss is possible</p>
<ul>
<li><a href="https://link.springer.com/chapter/10.1007%2F978-1-4612-0919-5_10">Foresight: Its Logical Laws, Its Subjective Sources</a></li>
</ul>
<p>II. If Bayesian ⟹ sure loss is not possible</p>
<ul>
<li><a href="https://www.jstor.org/stable/2268221?seq=1">On Confirmation and Rational Betting</a></li>
<li><a href="https://www.jstor.org/stable/2268222">Fair Bets and Inductive Probabilities</a></li>
</ul>
<p>Counter-arguments:</p>
<ul>
<li><a href="https://link.springer.com/article/10.1023/A:1004996226545">Hidden Assumptions in the Dutch Book Argument</a></li>
</ul>
<h1 id="axioms-of-probability"><a class="header-anchor" href="#axioms-of-probability">Axioms of probability</a></h1>
<p><strong>The axioms of probability:</strong><br />
Let $(\Omega, \mathcal{E}, P)$ be a measure space, where $\Omega$ is the sample set (mutually exclusive outcomes), $\mathcal{E}$ is the event set (set of measurable subsets of $\Omega$), and $P$ is the probability measure ($P(E),\ \forall E \in \mathcal{E}$ is well defined).</p>
<ol>
<li>$P(E) \geq 0,\ \forall E \in \mathcal{E}$</li>
<li>$P(\Omega) = 1$</li>
<li>$P(E_1 \cup E_2) = P(E_1) + P(E_2) \iff E_1 \cap E_2 = \emptyset,\ \forall E_1,E_2 \in \mathcal{E}$</li>
</ol>
<p>Note that $P(E) \leq 1,\ \forall E \in \mathcal{E}$ follows directly from the axioms.</p>
<h2 id="axioms-of-probability-for-propositional-logic"><a class="header-anchor" href="#axioms-of-probability-for-propositional-logic">Axioms of probability for propositional logic</a></h2>
<p>We can define probability over propositional statements. The sample set $\Omega$ is the set of all truth values of the primitives. If $(A_1, A_2, \ldots)$ is the set of all primitive propositions, then $\Omega = \{(\mathrm{False}, \mathrm{False}, \ldots), (\mathrm{True}, \mathrm{False}, \ldots), (\mathrm{False}, \mathrm{True}, \ldots), (\mathrm{True}, \mathrm{True}, \ldots), \ldots\}$ is every possible truth assignment for $(A_1, A_2, \ldots)$. This is assuming that we don’t know the truth value of any primitive. The <a href="https://en.wikipedia.org/wiki/Logical_connective">logical connectives</a>, $\wedge, \vee, \neg,$ etc., are all shorthands for constructing events (sets of truth assignments for $(A_1, A_2, \ldots)$). In other words, $P(H)$ is shorthand for the probability that proposition $H$ is true, where $H$ denotes an event $E$ containing exactly every truth assignment for $(A_1, A_2, \ldots)$ which makes $H$ true.</p>
<p>Note that when there are finitely many $A_i$, there will be finitely many possible events. However, there are infinitely many logical propositions over finitely many primitives $A_i$. This is because most logical propositions are equivalent to others. In other words, we are making equivalence class over the set of propositions using the sets of primitive assignments that make them true. The set of equivalence classes over propositions is finite for finitely many primitives.</p>
<p>Now the axioms for probability over propositional logical are just a special case of the general axioms:<br />
Let $\mathcal{H}$ be the set of all logical propositions.<br />
Let $\mathrm{True}$ be the proposition <em>True</em>, which is satisfied by all truth assignments of the primitives (i.e. the event containing all samples).</p>
<ol>
<li>$P(H)\geq 0,\ \forall H \in \mathcal{H}$</li>
<li>$P(\mathrm{True}) = 1$</li>
<li>$P(H_1 \vee H_2) = P(H_1) + P(H_2) \iff \neg (H_1 \wedge H_2),\ \forall H_1, H_2 \in \mathcal{H}$</li>
</ol>
<p>Axiom 3 states that the probability of $H_1$ or $H_2$ is the sum of their probabilities iff $H_1$ and $H_2$ cannot both be true at the same time. $H_1 \wedge H_2$ constructs the set of primitive assignments where both propositions are true, which is just the intersection of their respective events, $E_1 \cap E_2$. However, $H_1 \wedge H_2 = \mathrm{False}$ where $\mathrm{False}$ is the empty event is unconventional, so instead we write $\neg (H_1 \wedge H_2)$ which is equivalent to $\overline{E_1 \cap E_2} = \Omega$ (complement). One could also write $P(H_1 \vee H_2) = P(H_1) + P(H_2) \iff P(H_1 \wedge H_2) = 0$.</p>
<p>We could also define probability over 1st order logic. Now $A(x)$ is a proposition on $x$, where $x$ is a non-proposition type (e.g. number). Let’s say $x$ is a natural number, then we have infinitely many primitive propositions $A(x)$ for each $x \in \mathbb{N}$.</p>
<p>How does this inform the claims made in <a href="https://meaningness.com/probability-and-logic">https://meaningness.com/probability-and-logic</a> ?</p>
<h1 id="i-if-not-bayesian--sure-loss-is-possible"><a class="header-anchor" href="#i-if-not-bayesian--sure-loss-is-possible">I. If not Bayesian ⟹ sure loss is possible</a></h1>
<p>The Dutch book argument (DBA) uses traditional terminology around betting which I find to be confusing in this context, like <em>bookie</em>, <em>agent</em>, <em>making book</em>, etc., so I will take care to clarify the meaning of all these things.</p>
<p>Consider an asymmetric two-player betting game. Player 1 transacts with player 2 who sets prices.</p>
<p>Player 1:</p>
<ul>
<li>Called the <em>agent</em></li>
<li>Chooses what bets to buy or sell to/from the bookie at the bookie’s prices.</li>
<li>Chooses the stakes.</li>
<li>Accepts the bookie’s prices.</li>
</ul>
<p>Player 2:</p>
<ul>
<li>Called the <em>bookie</em></li>
<li>Chooses the prices for bets.</li>
<li>Must buy or sell any bets the agent requests, at whatever stakes the agent requests.</li>
</ul>
<p>The DBA shows that the agent can take advantage of the bookie (make book against the bookie) iff the bookie’s bet prices do not conform the the axioms of probability. Here taking advantage of means transacting (buying/selling) a set of bets with the bookie to guarantee the agent wins money off the bookie in every scenario.</p>
<p>We assume the bookie makes all prices on all possible bets known to the agent from the start, and the bookie cannot change these prices. The bookie wants to choose prices such that the agent cannot make book against him/her.</p>
<p>A bet is defined by a stake $S$, betting quotient $q$, and target event $E$. When the agent buys a bet (bookie sells a bet) at stake $S$ with quotient $q$, the agent’s payoff is $S-qS$ if $E$ occurs, and $-qS$ otherwise. $S$ is only payed out to the agent (holder of the stake) when the betting target $H$ is true. $qS$ is paid to the seller regardless of outcome. This is a fee.</p>
<p>The agent can also sell a bet to the bookie, which just negates the entries in the table. The bookie gives the agent a fee, and the agent pays out the stake to the bookie if $H$ is true.</p>
<p>We saw how probability over logical propositions is a special case. I think it is easier to reason about DBA if we instead consider an arbitrary probability distribution over events $E \in \mathcal{E}$. These events are the possible targets of bets. The bookie must choose $q(E)$ for each event. $q(E)$ will end up being a probability measure over $\mathcal{E}$. We will now show that if $q(E)$ violates any of the axioms, the agent can make book against the bookie.</p>
<p>Define a bet as a function $B : \Omega \to \mathbb{R}$ from samples to payoffs. A bet on event $E$ with stake $S$ and quotient $q$ has payoffs (w.r.t. buyer):</p>
<script type="math/tex; mode=display">% <![CDATA[
B_E(\omega) = \begin{cases}S-q(E)S & \omega \in E \\ -q(E)S & \omega \notin E\end{cases} %]]></script>
<p>This can be represented as a table:</p>
<table>
<thead>
<tr>
<th>Result</th>
<th>Payoff</th>
</tr>
</thead>
<tbody>
<tr>
<td>$E$</td>
<td>$S-q(E)S$</td>
</tr>
<tr>
<td>$\overline{E}$</td>
<td>$-q(E)S$</td>
</tr>
</tbody>
</table>
<p>Assuming the stake $S$ is always the same (this argument is invariant to stake, as long as its positive), then a bet is represented by $B_E$. Since this game is zero-sum, from the seller’s perspective, the payoff is $-B_E$. Buying $-B_E$ is equivalent to selling $B_E$. We can also add bets like this</p>
<script type="math/tex; mode=display">\left(B_{E_1} + B_{E_2}\right)(\omega) = B_{E_1}(\omega) + B_{E_2}(\omega)\,,</script>
<p>to construct a more complicated multi-outcome bet, denoted as $B_{E_1} + B_{E_2}$.</p>
<p>Now I am ready to outline why bets should conform to the three axioms:</p>
<p><strong>Axiom 1:</strong> $P(E) \geq 0,\ \forall E \in \mathcal{E}$.<br />
Assume $q(E) < 0$ for some $E$.<br />
Then the agent will buy $B_E$, which has a positive payoff in all cases.</p>
<p><strong>Axiom 2:</strong> $P(\Omega) = 1$.<br />
Note that by definition event $\Omega$ always happens, and this is known to both the agent and bookie.<br />
Assume $q(\Omega) < 1$.<br />
Then the agent will buy $B_\Omega$ since the payoff is always $S-q(\Omega)S$ which is positive.<br />
Assume $q(\Omega) > 1$.<br />
Then the agent will buy -$B_\Omega$ (sell $B_\Omega$) since the payoff is always $-(S-q(\Omega)S)$ which is positive.</p>
<p><strong>Axiom 3:</strong> $P(E_1 \cup E_2) = P(E_1) + P(E_2) \iff E_1 \cap E_2 = \emptyset,\ \forall E_1,E_2 \in \mathcal{E}$.<br />
Note that by definition $E_1 \cap E_2 = \emptyset$ implies $E_1$ and $E_2$ cannot happen simultaneously (by definition of the empty event), and this known to both the agent and bookie.<br />
Assume $E_1 \cap E_2 = \emptyset$ for some $E_1, E_2$.<br />
Assume $q(E_1 \cup E_2) > q(E_1) + q(E_2)$.<br />
Then the agent will buy $B_{E_1} + B_{E_2} - B_{E_1 \cup E_2}$ which has payoff table (w.r.t. agent):</p>
<table>
<thead>
<tr>
<th>Result</th>
<th>Payoff</th>
</tr>
</thead>
<tbody>
<tr>
<td>$E_1$</td>
<td>$-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$</td>
</tr>
<tr>
<td>$E_2$</td>
<td>$-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$</td>
</tr>
<tr>
<td>$\overline{E_1 \cup E_2}$</td>
<td>$-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$</td>
</tr>
</tbody>
</table>
<p>The payoff is the same in all cases and $E_1 \cap E_2$ never occurs. $-(q(E_1) + q(E_2) - q(E_1 \cup E_2))S$ is positive.<br />
Assume $q(E_1 \cup E_2) < q(E_1) + q(E_2)$.<br />
Then the agent buys $-B_{E_1} - B_{E_2} + B_{E_1 \cup E_2}$, which is easy to show wins money for the agent in every scenario.</p>
<p>Thus it is wise for the bookie to choose $q : \mathcal{E} \to \mathbb{R}$ s.t. it obeys the three axioms of probability.</p>
<p>The DBA is ingenious because it does not assume any a priori probabilities over outcomes (i.e. objective probability), and it holds for 1-shot events (i.e. does not assume the game is repeatable).</p>
<h2 id="but-how-does-this-lead-to-bayes-rule"><a class="header-anchor" href="#but-how-does-this-lead-to-bayes-rule">But how does this lead to Bayes rule?</a></h2>
<p>Bayesian epistemology centers around using Bayes rule to compute a posterior from a prior. Where is the prior and posterior here?</p>
<p>$q(E)$ is not a prior because $E$ is a datum, not a hypothesis. The DBA concludes that $q$ should be a valid probability measure. But how do we do all the fancy stuff that Bayesian inference requires like marginalizing over variables and computing conditional probabilities? To do that, we need at least two random variables, which we could define over our sample space. What would those two RVs be?</p>
<p>In the framing of DBA, the world starts out completely unknown, and at the conclusion of the betting becomes completely known. There is no reason for a prior or posterior distribution. $q(E)$ is a likelihood distribution conditioned on nothing, i.e. probability of data without regard to hypotheses. There is nothing Bayesian about this because we are <em>literally not using Bayes’ rule because we only have one random variable!</em></p>
<p>However, the DBA is clearly suggesting that $q(E)$ should encode the bookie’s beliefs about the outcomes, and is thus the prior. So then what is the posterior? We can get around this conundrum by supposing the bookie takes bets on the outcomes of some underlying process, i.e. time series, and updates $q(E)$ as time passes and outcomes are observed. Now we are computing a type of posterior: $q(E_t \mid E_{1:t-1})$ where $E_{1:t-1}$ is all previous observations. Hypotheses are technically not needed, but the bookie is free to secretly have a second RV over hypotheses under the hood (maybe the bookie is doing Solomonoff induction).</p>
<p><strong>Question:</strong> Are we confusing frequentist and Bayesian probability here?<br />
In the Bayesian paradigm, hypotheses are themselves usually probability distributions, i.e. $p(X \mid H=h) = p_h(X)$ where $p_h(X)$ is a hypothesis labeled with $h$. What is the meaning of the probabilities in $p_h(X)$? Are these probabilities objective? If not, what does it mean for a hypothesis to be satisfied by data? We could consider likelihood to be a score, rather than an objective quantity, and a better hypothesis has a better score by definition (rather than thinking of the likelihood of data under the hypothesis as a frequentist prediction that can be tested through repeated experiment).</p>
<h2 id="but-is-real-life-a-series-of-bets"><a class="header-anchor" href="#but-is-real-life-a-series-of-bets">But is real-life a series of bets?</a></h2>
<p>The setup of the game described above ends up being isomorphic to probability theory.</p>
<p><strong>Question:</strong> Why does this isomorphism exist? Is there something intrinsic about betting that makes it conform to the rules of probability, or is this an artifact of the particular betting payout definition we are using?<br />
This payout scheme is apparently economically justified and not arbitrarily chosen, i.e. bets (with predetermined payouts) traded on a market with have purchase prices that converge to the model above (assuming sufficient arbitrage). Note, IRL quotients are discretized and don’t sum exactly to 1, and there is a ask-bid spread which essentially ads a transaction cost to everything. Real world example: <a href="https://www.predictit.org/markets/detail/3698">https://www.predictit.org/markets/detail/3698</a>. In economics, decisions under uncertainty are modeled in the same betting form. e.g. insurance (premium is the quotient, payout is the stake).</p>
<p><strong>Question:</strong> During the course of everyday life, is the universe going to make book against us lest we conform to the rules of probability? Do the sorts of real-life bets we actually encounter and place have the same structure as the idealized betting above?</p>
<p>Who is the bookie and agent? DBA says that you are the bookie over the course of your life, and you want to prevent the universe (or adversarial actors) from taking advantage of you. The problem is that you are often the one making decisions, i.e. deciding the bets to place. This is a different game, where the bookie chooses the quotients and the bets together. Also, real-life is not zero-sum. You will encounter win-win and lose-lose situations where you have to place a bet one way or the other and net win or lose. The universe is not an optimally rational agent either. I don’t expect the universe to spontaneously Dutch-book me. I don’t even expect people to Dutch-book me, because that would take work. In practice, everyone is not acting optimally.</p>
<p><strong>Question:</strong> What if outcomes are not binary? i.e. $\omega \in E$ and $\omega \notin E$ are not the only possibilities (i.e. don’t assume law of excluded middle, i.e. need to construct a proof for $\omega \in E$ or $\omega \notin E$).<br />
For example, what if it is not always possible to determine whether an event occurred? This is the case in Solomonoff induction, which uses a semi-measure rather than a measure to get around this problem. In practice, with things like elections and trials there is a large vested interest in ensuring an outcome is determined. But over the course of everyday life, there are many ambiguities.</p>
<p>In real life, you are more like the agent. You choose your bets (take actions) with pre-defined payoffs (the downstream results of your actions are not usually in your control). These payoff are not logically determined, but are the result of (often arbitrary) circumstance. It is very easy to Dutch book the universe! That’s generally how growth and progress happen.</p>
<p>Presumably in a formal betting scenario the bookie’s probabilities are well-tuned, so that the bookie is indifferent to whether someone buys or sells a given bet. In everyday life, the payoffs of your decisions usually do not match your preferred betting quotients, so that there is one or a few best bets. The whole point of betting is that you believe the outcomes don’t match the “true” betting quotients. The DBA assumes that someone else might give you a series of bets which are locally in agreement with your quotients but globally a guaranteed loss. The problem is, you may not be compelled to take bets that agree with your expectations, but only take bets where the expected return is positive, i.e. disagreement.</p>
<h1 id="ii-if-bayesian--sure-loss-is-not-possible"><a class="header-anchor" href="#ii-if-bayesian--sure-loss-is-not-possible">II. If Bayesian ⟹ sure loss is not possible</a></h1>
<p>TODO</p>
Thu, 11 Jun 2020 00:00:00 -0700
pragmanym.github.io/zhat/articles/notes-dutch-book-argument
pragmanym.github.io/zhat/articles/notes-dutch-book-argumentnotesNotes: Complete Class Theorems<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#results-to-understand-in-hoff" id="markdown-toc-results-to-understand-in-hoff">Results to understand in Hoff</a></li>
<li><a href="#notes" id="markdown-toc-notes">Notes</a> <ul>
<li><a href="#complete-class-theorem-i" id="markdown-toc-complete-class-theorem-i">Complete class theorem I</a></li>
<li><a href="#complete-class-theorem-ii" id="markdown-toc-complete-class-theorem-ii">Complete class theorem II</a></li>
<li><a href="#euclidean-parameter-spaces" id="markdown-toc-euclidean-parameter-spaces">Euclidean parameter spaces</a></li>
<li><a href="#complete-class-theorem-iii" id="markdown-toc-complete-class-theorem-iii">Complete class theorem III</a></li>
</ul>
</li>
<li><a href="#interpretation-and-implications" id="markdown-toc-interpretation-and-implications">Interpretation and implications</a> <ul>
<li><a href="#discussion" id="markdown-toc-discussion">Discussion</a></li>
</ul>
</li>
</ul>
<p><strong>Objective:</strong> I want to understand the complete class theorems because they are a common argument for Bayesian epistemology, a theory of knowledge that puts forward Bayesian posterior calculation as all you need. In order to properly evaluate whether “being Bayesian” is enough of a theoretical framework to build and explain intelligence, I need to understand arguments for Bayesian epistemology.</p>
<p>The argument boils down to:</p>
<blockquote>
<p>If you agree with expected utility as your objective, then you have to be Bayesian.</p>
</blockquote>
<p>In a nutshell: An strategy is inadmissible if there exists another strategy that is as good in all situations and strictly better in at least one. If you want your strategy to be admissible, it should be equivalent to a Bayes estimator.</p>
<p>Complete class theorems: Only Bayes strategies are admissible, and admissible strategies are Bayes.</p>
<p>I’m mainly following <a href="https://www.stat.washington.edu/people/pdhoff/courses/581/LectureNotes/admiss.pdf">Admissibility and complete classes - Peter Hoff</a>.</p>
<p>Related study notes: <a href="https://docs.google.com/document/d/1fCseo1fsPwJfjnehauAzOr4bf1GHHRfRW6cHwNQTNu4/edit">Wald’s Complete Class Theorem(s) - study notes</a></p>
<h1 id="results-to-understand-in-hoff"><a class="header-anchor" href="#results-to-understand-in-hoff">Results to understand in <a href="https://www.stat.washington.edu/people/pdhoff/courses/581/LectureNotes/admiss.pdf">Hoff</a></a></h1>
<p><strong>Section 1</strong>:<br />
<img src="https://i.imgur.com/KSZ6PVb.png" alt="" /></p>
<p><strong>Section 2</strong>:<br />
<img src="https://i.imgur.com/F94ljVs.png" alt="" /><br />
<img src="https://i.imgur.com/2QW8pcP.png" alt="" /></p>
<p><strong>Section 3</strong>:<br />
<img src="https://i.imgur.com/U2npCDa.png" alt="" /></p>
<p><strong>Section 4</strong>:<br />
<img src="https://i.imgur.com/XPqhZ4E.png" alt="" /><br />
<img src="https://i.imgur.com/H8uda4H.png" alt="" /><br />
<img src="https://i.imgur.com/fAvOCcu.png" alt="" /></p>
<p><strong>Section 5</strong> covers similar results for infinite parameter spaces (so far results are for finite parameter spaces).</p>
<p><strong>Section 6</strong>:<br />
<img src="https://i.imgur.com/B9EDHSE.png" alt="" /></p>
<p><img src="https://i.imgur.com/TQjLvpT.png" alt="" /><br />
<img src="https://i.imgur.com/RRj8Mwi.png" alt="" /><br />
<img src="https://i.imgur.com/XVFp6DY.png" alt="" /></p>
<h1 id="notes"><a class="header-anchor" href="#notes">Notes</a></h1>
<script type="math/tex; mode=display">\newcommand{\bb}{\mathbb}
\newcommand{\mc}{\mathcal}
\newcommand{\d}{\delta}
\newcommand{\p}{\pi}
\newcommand{\t}{\theta}
\newcommand{\T}{\Theta}
\newcommand{\fa}{\forall}
\newcommand{\ex}{\exists}
\newcommand{\real}{\bb{R}}
\newcommand{\E}{\bb{E}}
\renewcommand{\D}[1]{\operatorname{d}\!{#1}}
\DeclareMathOperator*{\argmin}{argmin}</script>
<p>Let $(\mc{X}, \mc{A}, P_\t)$ be a probability space for all $\t \in \T$.<br />
$\mc{X}$ is the sample space.<br />
$\T$ is the parameter space.<br />
$\mc{P} = \{P_\t : \t \in \T\}$ is the <em>model</em>, i.e. the set of all probability measures specified by the parameter space.</p>
<p>We wish to estimate some unknown $g(\t)$ which depends in a known way on $\t$. The text does not tell us what type $g(\t)$ is, and it does not matter for the discussion since it will always be hidden behind our loss function. The text uses $g(\T)$ (the image of $g$) to denote the space of all such $g$, but I find it less confusing and more direct to use $G = g(\T)$.</p>
<p>A <strong>loss function</strong> is a function $L : \T \times G \to \real^+$ which is always 0 for equivalent inputs, i.e.<br />
<script type="math/tex">L(\t, g(\t)) = 0,\ \fa \t \in \T\,.</script><br />
Note that $L(\t_1, g(\t_2))$ may be 0 when $\t_1 \neq \t_2$.</p>
<p>A <strong>non-randomized estimator</strong> for $g(\t)$ is a function $\d : \mc{X} \to G$ s.t. $x \mapsto L(\t, \d(x))$ is a measurable function (of $x$) for all $\t \in \T$. A <a href="https://en.wikipedia.org/wiki/Measurable_function">function is measurable</a> if the preimage of any measurable set is measurable, i.e. it preserves measurability. Concretely in this case, $\{x : L(\t, \d(x)) \in B\} \in \mc{A}$ for all $B \in \mc{B}(\real)$, where $\mc{A}$ is our event space (set of all subsets of $\mc{X}$ which can be measured by $P_\t$), and $\mc{B}(\real)$ is the <a href="https://mathworld.wolfram.com/BorelSet.html">Borel $\sigma$-algebra</a> over the reals, which is a standard definition of measurable sets of reals (unions and intersections of closed and open intervals are measurable). Presumably $\d$ is non-randomized because it only depends on the ground truth $x$.</p>
<p>The <strong>risk function</strong> of estimator $\d$ is the expected loss:<br />
<script type="math/tex">R(\t, \d) = \E_{x \sim X}\left[L(\t, \d(x)) \mid \t\right] = \int_\mc{X} L(\t, \d(x))P_\t(x) \D{x}</script></p>
<p>A <strong>randomized estimator</strong> is a function $\d : \mc{X} \times [0, 1] \to G$ s.t. $(x, u) \mapsto L(\t, \d(x, u))$ is a measurable function (of $x$ and $u$) for all $\t \in \T$. Just like a non-randomized estimator, except it recieves noise from $U \sim \mathrm{uniform}([0, 1])$ as input. Non-randomized estimators are a special case (ignores the random input). Conversely, a randomized estimator can be viewed as a distribution over non-randomized estimators (which are parametrized by $u \in [0, 1]$).</p>
<p>The risk function then integrates over $u$:<br />
<script type="math/tex">R(\t, \d) = \E_{x \sim X, u \sim U}\left[L(\t, \d(x, u)) \mid \t\right] = \int_0^1 \int_\mc{X} L(\t, \d(x, u))P_\t(x) \D{x} \D{u}</script></p>
<p>An estimator $\d_1$ <strong>dominates</strong> another estimator $\d_2$ iff<br />
\begin{align}<br />
\fa \t \in \T,\ R(\t, \d_1) \leq R(\t, \d_2)\,, \<br />
\ex \t \in \T,\ R(\t, \d_1) < R(\t, \d_2)\,.<br />
\end{align}<br />
$\d_1$ must be at least as good (same risk or less) as $\d_2$ in every situation, and must be strictly better (less risk) in at least one situation, for the descriptor <em>dominance</em> to apply.</p>
<p>An estimator $\d$ is <strong>admissible</strong> if it is not dominated by any estimator.<br />
Admissibility does not mean an estimator is any good, however, but any inadmissible estimator can be automatically ruled out.</p>
<p>Let $\mc{D}$ be the set of all randomized estimators.<br />
A <strong>class</strong> (subset) of estimators $\mc{C} \subset \mc{D}$ is <strong>complete</strong> iff $\fa \d’ \in \mc{C}^c,\ \ex \d \in \mc{C}$ that dominates $\d’$.<br />
Here $(\cdot)^c$ is the compliment operator, i.e. $\mc{C}^c = \{\d’ \in \mc{D} : \d’ \notin \mc{C}\}$.</p>
<p>Let $\p$ be a probability measure on $\T$ and $\d$ be an estimator (from here on it does not matter if $\d$ is randomized or not because the risk does not depend on the arguments of $\d$).</p>
<p>The <strong>Bayes risk</strong> of $\d$ w.r.t. $\p$ is</p>
<script type="math/tex; mode=display">R(\p, \d) = \E_{\p(\t)}[R(\t, \d)] = \int_\T R(\t, \d) \p(\t) \D{\t}\,.</script>
<p>This is the expected risk w.r.t. $\p(\t)$, which is called our <strong>prior</strong>.</p>
<p>Bayes risk allows us to compare estimators by comparing numbers rather than functions, but now we have a new problem, which is that we have to choose a prior.</p>
<p>$\d$ is a <strong>Bayes estimator</strong> w.r.t. $\p$ iff</p>
<script type="math/tex; mode=display">R(\p, \d) \leq R(\p, \d'),\ \fa \d' \in \mc{D}\,.</script>
<p>Note that a Bayes estimator $\t$ can be dominated if $\pi$ assigns measure 0 to some subsets of $\T$. It is easy to show that if $\t$ is dominated by $\t’$, then $\t’$ is also Bayes and $R(\p, \d) = R(\p, \d’)$.</p>
<p><strong>Theorem 1</strong> (Bayes $\implies$ admissible): If prior $\pi(\theta)$ has exactly one Bayes estimator, then that estimator is admissible.</p>
<blockquote>
<p>Thus the only thing that can dominate a Bayes estimator is another Bayes estimator. If there is only one Bayes estimator for a given prior, then it must be admissible.</p>
</blockquote>
<p><strong>Question:</strong> Under what conditions is there more than one Bayes estimator for a given prior?</p>
<p><strong>Theorem 3</strong> (Bayes $\implies$ admissible):<br />
<img src="https://i.imgur.com/9H363wf.png" alt="" /></p>
<h2 id="complete-class-theorem-i"><a class="header-anchor" href="#complete-class-theorem-i">Complete class theorem I</a></h2>
<p>(admissible $\implies$ Bayes)</p>
<blockquote>
<p>If $\d$ is admissible and $\T$ is finite, then $\d$ is Bayes (w.r.t some prior distribution).</p>
</blockquote>
<h2 id="complete-class-theorem-ii"><a class="header-anchor" href="#complete-class-theorem-ii">Complete class theorem II</a></h2>
<p>Class of Bayes estimators is complete</p>
<blockquote>
<p>If $\T$ is finite and $\mc{S}$ is closed then the class of Bayes rules is complete and the admissible rules form a minimal complete class.</p>
</blockquote>
<h2 id="euclidean-parameter-spaces"><a class="header-anchor" href="#euclidean-parameter-spaces">Euclidean parameter spaces</a></h2>
<p>TODO: generalized Bayes estimator<br />
TODO: limiting Bayes estimator</p>
<p>Bayes $\implies$ Admissible<br />
<img src="https://i.imgur.com/BZJqBWn.png" alt="" /></p>
<p>Admissible $\implies$ Bayes<br />
<img src="https://i.imgur.com/CGYqfbF.png" alt="" /></p>
<h2 id="complete-class-theorem-iii"><a class="header-anchor" href="#complete-class-theorem-iii">Complete class theorem III</a></h2>
<p>Class of Bayes estimators is complete<br />
<img src="https://i.imgur.com/CFCCIMO.png" alt="" /></p>
<h1 id="interpretation-and-implications"><a class="header-anchor" href="#interpretation-and-implications">Interpretation and implications</a></h1>
<p><strong>Question:</strong> What is the connection between <a href="https://en.wikipedia.org/wiki/Bayes_estimator#Definition">Bayesian estimators</a> and Bayesian posteriors?</p>
<p>Answer: Bayes estimator predicts the mean posterior for L2 loss, median for L1 loss. [credit: John Chung]</p>
<p><strong>Theorem</strong>:<br />
If $\p(\t)$ is a given prior, then a corresponding Bayes estimator is $\d$ is</p>
<script type="math/tex; mode=display">\d(x) = \argmin_{\hat{\t}} \E_{\t \sim p_\p(\t \mid x)}\left[L(\t, \hat{\t})\right] = \argmin_{\hat{\t}} \int_{\T} L(\t, \hat{\t}) p_\pi(\t \mid x) \D{\t}\,,</script>
<p>where the posterior is $p_\pi(\t \mid x) = P_\t(x)\pi(\t)/p_\p(x)$ and marginal data distribution is $p_\p(x) = \int P_\t(x)\pi(\t) \D{\t}$.<br />
In words, the Bayes estimator minimizes the posterior expected loss for every $x$.</p>
<p><em>Proof:</em><br />
<br />(This proof is my own)</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\min_{\hat{\d}}R(\p, \hat{\d}) &= \min_{\hat{\d}} \int_\mc{X}\int_\T L(\t, \hat{\d}(x)) P_\t(x)\p(\t) \D{\t}\D{x} \\
&= \min_{\hat{\d}} \int_\mc{X}\left(\int_\T L(\t, \hat{\d}(x)) p_\pi(\t \mid x) \D{\t}\right) p_\p(x) \D{x} \\
&= \int_\mc{X}\left(\min_{\hat{\d}_x} \int_\T L(\t, \hat{\d}_x) p_\pi(\t \mid x) \D{\t}\right) p_\p(x) \D{x} \\
&= \E_{x \sim p_\p(x)}\left[\min_{\hat{\d}_x} \int_\T L(\t, \hat{\d}_x) p_\pi(\t \mid x) \D{\t}\right] \\
&= \E_{x \sim p_\p(x)}\left[\min_{\hat{\d}_x} \E_{\t \sim p_\p(\t \mid x)}\left[L(\t, \hat{\d}_x)\right] \right]\,.
\end{align} %]]></script>
<p>So the min Bayes risk is expected (w.r.t. data) minimum “posterior expected loss”.</p>
<p>Thus if we define $\d(x) := \d^*_x,\ \forall x \in \mc{X}$, where</p>
<script type="math/tex; mode=display">\d^*_x = \argmin_{\hat{\d}_x} \E_{\t \sim p_\p(\t \mid x)}\left[L(\t, \hat{\d}_x)\right]\,,</script>
<p>then $\d = \argmin_\hat{\d} R(\p, \hat{\d})\,.$<br />
<em>QED</em></p>
<p>The general form</p>
<script type="math/tex; mode=display">b^* = \argmin_b \E_A \left[L(A, b)\right]</script>
<p>is called the <em>systematic part</em> of random variable $A$. When $L$ is squared difference (i.e. $\ell^2$), then $b^*$ is the mean of $A$. When $L$ is absolute difference (i.e. $\ell^1$), then $b^*$ is the median of $A$. When $L$ is the indicator loss (i.e. $\ell^0$), then $b^*$ is the mode of $A$. There are also losses corresponding to other distribution statistics like quantile loss. See the definition of <em>systematic part</em> in my post on the <a href="http://zhat.io/articles/bias-variance#bias-variance-decomposition-for-any-loss">generalized bias-variance decomposition</a>.</p>
<p>$\d$ will be the mean, median, or mode of the posterior for $\ell^2$, $\ell^1$, $\ell^0$ losses respectively. To avoid confusion, here it is stated explicitly:</p>
<p>If $L(\t, \hat{\t}) = (\t - \hat{\t})^2$, then</p>
<script type="math/tex; mode=display">\d(x) = \mathrm{Mean}_{\t \sim p_\p(\t \mid x)}\left[\t\right] = \E_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,.</script>
<p>If $L(\t, \hat{\t}) = \lvert\t - \hat{\t}\rvert$, then</p>
<script type="math/tex; mode=display">\d(x) = \mathrm{Median}_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,.</script>
<p>If $L(\t, \hat{\t}) = (\t - \hat{\t})^0$, then</p>
<script type="math/tex; mode=display">\d(x) = \mathrm{Mode}_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,.</script>
<p>If <script type="math/tex">% <![CDATA[
L(\t, \hat{\t}) = \begin{cases}\tau\cdot(\t - \hat{\t}) & \t - \hat{\t} \geq 0 \\ (\tau-1)\cdot(\t - \hat{\t}) & \mathrm{otherwise}\end{cases}, %]]></script> then</p>
<script type="math/tex; mode=display">\d(x) = \mathrm{Quantile}\{\tau\}_{\t \sim p_\p(\t \mid x)}\left[\t\right]\,,</script>
<p>and $\tau=\frac{1}{2}$ gives the median.</p>
<h2 id="discussion"><a class="header-anchor" href="#discussion">Discussion</a></h2>
<p>Do the complete class theorems prove the necessity of Bayesian epistemology (assuming you wish to be rational)?</p>
<ol>
<li>Complete class theorems assume the data has a well defined probability distribution. If we use CCTs to justify Bayesian epistemology (i.e. usage of probability for outcomes which do not repeat, have a frequency or occurrence, or any well defined objective notion of probability) then this argument is circular. It depends on frequentist probability being a thing, and Bayesian probability is enticing over frequentist probability because frequentist probability only makes sense in limited circumstances where events have well defined frequencies of occurrence.</li>
<li>Enforcing admissibility may be inconsequential. This framework is silent on how to define the hypothesis space and choose a prior, which matters quite a lot for 1 shot prediction, but doesn’t matter at infinite data limit. In practice we don’t care about the infinite data limit. In practice, picking the wrong hypothesis space or a bad prior may impact your utility much more than being admissible.</li>
<li>The result above shows that only the <em>systematic part</em> (e.g. mean) of the posterior matters for minimizing Bayes risk.</li>
</ol>
Thu, 11 Jun 2020 00:00:00 -0700
pragmanym.github.io/zhat/articles/notes-complete-class-theorems
pragmanym.github.io/zhat/articles/notes-complete-class-theoremsnotesPrimer to Shannon's Information Theory<p>Shannon’s theory of information is usually just called <em>information theory</em>, but is it deserving of that title? Does Shannon’s theory completely capture every possible meaning of the word <em>information</em>? In the grand quests to creating AI and understanding the rules of the universe (i.e. grand unified theory) information may be key. Intelligent agents search for information and manipulate it. Particle interactions in physics may be viewed as information transfer. The physics of information may be key to interpreting quantum mechanics and resolving the measurement problem.</p>
<p>If you endeavor to answer these hard questions, it is prudent to understand existing so-called theories of information so you can evaluate whether they are powerful enough and to take inspiration from them.</p>
<p>Shannon’s information theory is a hard nut to crack. Hopefully this primer gets you far enough along to be able to read a textbook like <em>Elements of Information Theory</em>. At the end I start to explore the question of whether Shannon’s theory is a complete theory of information, and where it might be lacking.</p>
<p>This post is long. That is because Shannon’s information theory is a framework of thought. That framework has a vocabulary which is needed to appreciate the whole. I attempt to gradually build up this vocabulary, stopping along the way to build intuition. With this vocabulary in hand, you will be ready to explore the big questions at the end of this post.</p>
<!--more-->
<ul class="toc" id="markdown-toc">
<li><a href="#self-information" id="markdown-toc-self-information">Self-Information</a> <ul>
<li><a href="#regarding-notation" id="markdown-toc-regarding-notation">Regarding notation</a></li>
<li><a href="#bits-not-bits" id="markdown-toc-bits-not-bits"><em>Bits</em>, not bits</a> <ul>
<li><a href="#recap" id="markdown-toc-recap">Recap</a></li>
</ul>
</li>
<li><a href="#stepping-back" id="markdown-toc-stepping-back">Stepping back</a></li>
</ul>
</li>
<li><a href="#entropy" id="markdown-toc-entropy">Entropy</a> <ul>
<li><a href="#regarding-notation-1" id="markdown-toc-regarding-notation-1">Regarding notation</a></li>
<li><a href="#conditional-entropy" id="markdown-toc-conditional-entropy">Conditional Entropy</a></li>
</ul>
</li>
<li><a href="#mutual-information" id="markdown-toc-mutual-information">Mutual Information</a> <ul>
<li><a href="#pointwise-mutual-information" id="markdown-toc-pointwise-mutual-information">Pointwise Mutual Information</a></li>
<li><a href="#properties-of-pmi" id="markdown-toc-properties-of-pmi">Properties of PMI</a> <ul>
<li><a href="#special-values" id="markdown-toc-special-values">Special Values</a></li>
</ul>
</li>
<li><a href="#expected-mutual-information" id="markdown-toc-expected-mutual-information">Expected Mutual Information</a> <ul>
<li><a href="#channel-capacity" id="markdown-toc-channel-capacity">Channel capacity</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#shannon-information-for-continuous-distributions" id="markdown-toc-shannon-information-for-continuous-distributions">Shannon Information For Continuous Distributions</a> <ul>
<li><a href="#proof-that-mi-is-fininte-for-continuous-distributions" id="markdown-toc-proof-that-mi-is-fininte-for-continuous-distributions">Proof that MI is fininte for continuous distributions</a></li>
</ul>
</li>
<li><a href="#problems-with-shannon-information" id="markdown-toc-problems-with-shannon-information">Problems With Shannon Information</a> <ul>
<li><a href="#1-tv-static-problem" id="markdown-toc-1-tv-static-problem">1. TV Static Problem</a></li>
<li><a href="#2-shannon-information-is-blind-to-scrambling" id="markdown-toc-2-shannon-information-is-blind-to-scrambling">2. Shannon Information is Blind to Scrambling</a></li>
<li><a href="#3-deterministic-information" id="markdown-toc-3-deterministic-information">3. Deterministic information</a></li>
<li><a href="#4-if-the-universe-is-continuous-everything-contains-infinite-information" id="markdown-toc-4-if-the-universe-is-continuous-everything-contains-infinite-information">4. If the universe is continuous everything contains infinite information</a></li>
<li><a href="#5-shannon-information-ignores-the-meaning-of-messages" id="markdown-toc-5-shannon-information-ignores-the-meaning-of-messages">5. Shannon information ignores the meaning of messages</a></li>
<li><a href="#6-probability-distributions-are-not-objective" id="markdown-toc-6-probability-distributions-are-not-objective">6. Probability distributions are not objective</a></li>
</ul>
</li>
<li><a href="#appendix" id="markdown-toc-appendix">Appendix</a> <ul>
<li><a href="#properties-of-conditional-entropy" id="markdown-toc-properties-of-conditional-entropy">Properties of Conditional Entropy</a></li>
<li><a href="#bayes-rule" id="markdown-toc-bayes-rule">Bayes’ Rule</a></li>
<li><a href="#cross-entropy-and-kl-divergence" id="markdown-toc-cross-entropy-and-kl-divergence">Cross Entropy and KL-Divergence</a></li>
</ul>
</li>
<li><a href="#acknowledgments" id="markdown-toc-acknowledgments">Acknowledgments</a></li>
</ul>
<h1 id="self-information"><a class="header-anchor" href="#self-information">Self-Information</a></h1>
<script type="math/tex; mode=display">\newcommand{\and}{\wedge}
\newcommand{\or}{\vee}
\newcommand{\E}{\mathbb{E}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\bm}{\boldsymbol}
\newcommand{\rX}{\bm{X}}
\newcommand{\rY}{\bm{Y}}
\newcommand{\rZ}{\bm{Z}}
\newcommand{\rC}{\bm{C}}
\newcommand{\diff}[1]{\mathop{\mathrm{d}#1}}
\newcommand{\kl}[2]{K\left[#1\;\middle\|\;#2\right]}</script>
<p>I’m going to use non-standard notation which I believe avoids some confusion and ambiguities.</p>
<p>Shannon defines information indirectly by defining quantity of information contained in a message/event. This is analogous how physics defines mass and energy in terms of their quantities.</p>
<p>Let’s define $x$ to be any mathematical object from a set of possibilities $X$. We typically call $x$ a <em>message</em>, but it can also be referred to as an <em>outcome</em>, <em>state</em>, or <em>event</em> depending on the context.</p>
<p>Define <span class="marginnote-outer"><span class="marginnote-ref">$h(x)$</span><label for="ad226219b7413f6c19c43a404b5a9d36708aa40e" class="margin-toggle"> ⊕</label><input type="checkbox" id="ad226219b7413f6c19c43a404b5a9d36708aa40e" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The standard notation is $I(x)$, but this is easy to confuse with mutual information <a href="#expected-mutual-information">below</a>.</span></span></span> to be the <strong>self-information</strong> of $x$, which is the amount of information gained by <span class="marginnote-outer"><span class="marginnote-ref">receiving</span><label for="5aecb422be00894a86508ac09f6366924a33ad33" class="margin-toggle"> ⊕</label><input type="checkbox" id="5aecb422be00894a86508ac09f6366924a33ad33" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Receiving here can mean, (1) sampling an outcome from a distribution, (2) storing in memory <em>one</em> of its possible states, or (3) viewing with the mind or knowing to be the case one out of the possible cases.</span></span></span> $x$. We will see how a natural definition of $h(x)$ arises from combining these two principles:</p>
<ol>
<li>Quantity of information is a function only of probability of occurrence.</li>
<li>Quantity of information acts like quantity of bits when applied to computer memory.</li>
</ol>
<p>Principle (1) constrains $h$ to the form $h(x) = f(p_X(x))$, and we do not yet know what $f$ should be.</p>
<p>To see why, let’s unpack (1): it implies that messages/events must always come from a distribution, which is what provides the probabilities. Say you receive a message $x$ sampled from probability distribution (function) $p_X : X \to [0, 1]$ over a <span class="marginnote-outer"><span class="marginnote-ref">discrete</span><label for="36d6168a1655cd352523cce0e3a2ac045fc1621d" class="margin-toggle"> ⊕</label><input type="checkbox" id="36d6168a1655cd352523cce0e3a2ac045fc1621d" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Assume all distributions are discrete until the <a href="#shannon-information-for-continuous-distributions">continuous section</a>.</span></span></span> set $X$. Then (1) is saying that $h$ should only <em>look at</em> the probability $p_X(x)$ and not $x$ itself. This is a reasonable requirement, since we want to define information irrespective of the kind of object that $x$ is.</p>
<p>Principle (2) constrains what $f$ should be: <span class="marginnote-outer"><span class="marginnote-ref">$f(p) = -\log_2 p$</span><label for="ff9302aebed53145e61c362eae63cb172e24e20a" class="margin-toggle"> ⊕</label><input type="checkbox" id="ff9302aebed53145e61c362eae63cb172e24e20a" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Though we assume a uniform discrete probability distribution to derive this, we will use this definition of $f$ to generalize the same logic to all probability distributions, which is how we arrive at the final definition of $h$.</span></span></span>, where $p \in [0, 1]$ is a probability value.</p>
<p>To understand (2), consider computer memory. With $N$ bits of memory there are $2^N$ distinguishable states, and only one is the case at one time. Increasing the number of bits exponentially increases the <span class="marginnote-outer"><span class="marginnote-ref">number of counterfactual states</span><label for="b6ce5dab64f8e17412d6f1eecdbc565ff4506d08" class="margin-toggle"> ⊕</label><input type="checkbox" id="b6ce5dab64f8e17412d6f1eecdbc565ff4506d08" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Number of states you could have stored but didn’t.</span></span></span>. In memory terms, receiving a “message” of $N$ bits of memory simply means finding out the state those bits are in. Attaching equal weight to each possibility (i.e. memory state) gives us a <span class="marginnote-outer"><span class="marginnote-ref">special case of the probability distribution we used above</span><label for="89a786f47bc58cb026e252161192577d84784488" class="margin-toggle"> ⊕</label><input type="checkbox" id="89a786f47bc58cb026e252161192577d84784488" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">To see the equivalence between these two notions of information, i.e. more rare equals more informative vs number of counterfactual states (or memory capacity), it is useful to think of the probability distribution as a weighted possibility space, and of the memory states as possibilities.</span></span></span> to define $h$: the <em>uniform</em> distribution, where there are $2^N$ possible states and the weight of single state is $\frac{1}{2^N} = 2^{-N}$.</p>
<!--We intuitively think of the quantity of information stored in memory as the number of bits it has. We have $f(p)$ return $N$ when every possible state has an equal weight of $p=2^{-N}$, because we assume a uniform distribution over $2^N$ states, which is equivalent to how we conceive of computer memory with $N$ bits.-->
<p>Composing $f(p) = -\log_2 p$ with $h(x) = f(p_X(x))$ gives us the full definition of self-information:</p>
<script type="math/tex; mode=display">h(x) = -\log_2 p_X(x)\,.</script>
<!--*Now the magic happens.* Given that we defined self-information as $h(x) = f(p_X(x))$, and given that we've pinned down $f(p) = -\log_2 p$ for a special case, we've done all the work we need to do to define $h(x)$ for all probability distributions, because nothing in our definition of $f(p)$ actually depends on the particular distribution we used.-->
<h2 id="regarding-notation"><a class="header-anchor" href="#regarding-notation">Regarding notation</a></h2>
<p>From here on, I will use $h(x)$ as a function of message $x$, without specifying the type of $x$. It can be anything: a number, a binary sequence, a string, etc. $f(p)$ is a function of probabilities, rather than messages. So:</p>
<p style="text-align: center;">$h : X \to \R^+$ maps from messages to information,<br />
and $f : [0, 1] \to \R^+$ maps from probabilities to information;</p>
<p>and keep in mind that $h(x) = f(p_X(x))$, so <span class="marginnote-outer"><span class="marginnote-ref">$h$ implicitly assumes</span><label for="ac81d19ce662d5360601c37e5741410b398a15da" class="margin-toggle"> ⊕</label><input type="checkbox" id="ac81d19ce662d5360601c37e5741410b398a15da" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">I may sometimes write $h_X$ to make explicit the dependency of $h$ on $p_X$.</span></span></span> we have a probability distribution over $x$ defined somewhere.</p>
<p>In some places below <span class="marginnote-outer"><span class="marginnote-ref">I’ve written equations in terms of $f$ rather than $h$</span><label for="407f53214d4b89bb8df9485bfcc2f05a23b9bf92" class="margin-toggle"> ⊕</label><input type="checkbox" id="407f53214d4b89bb8df9485bfcc2f05a23b9bf92" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Allow me the slight verbosity now, as you’d probably have had to pore over verbose definitions if I hadn’t.</span></span></span> where I felt it would allow you to grasp things just by looking at the shape of the equation.</p>
<h2 id="bits-not-bits"><a class="header-anchor" href="#bits-not-bits"><em>Bits</em>, not bits</a></h2>
<p>You can get through the above exposition by thinking in terms of computer bits. Now we part ways from the computer bits intuition. Note that this departure occurs when $p_X(x)$ is not a (negative) integer power of two. $h(x)$ will be non-integer, and very likely irrational. What does it mean to have a fraction of a bit? From here on out, it’s better to think of <em>bits</em> as a unit quantifying information, like <em>Joules</em> for energy or <em>kilogram</em> for mass, rather than a count of <span class="marginnote-outer"><span class="marginnote-ref">physical objects</span><label for="9e6ffa3d72e725743de4a6c7caafdd2568ea4337" class="margin-toggle"> ⊕</label><input type="checkbox" id="9e6ffa3d72e725743de4a6c7caafdd2568ea4337" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Specifically a physical medium that stores two distinguishable states, usually labeled “0” and “1”.</span></span></span>. We will continue to call the unit of $h(x)$ a <em>bit</em> out of convention. Like the kilogram and Joule, this unit can be regarded as undefined in absolute terms, but its usage gives it semantic meaning.</p>
<p>So then how is $h$ to be understood? What is the intuition behind this quantity? In short, Shannon bits are an <a href="https://en.wikipedia.org/wiki/Analytic_continuation">analytic continuation</a> of computer bits. Just like how the <a href="https://en.wikipedia.org/wiki/Gamma_function">gamma function</a> extends factorial to continuous values, Shannon bits extend the computer bit to <strong>non-uniform distributions</strong> over a <strong>non-power-of-2</strong> number of counterfactuals. Let me explain these two phrases:</p>
<ul>
<li><strong>non-power-of-2</strong>: We have memory that can store one out of $M$ possibilities, where $M \neq 2^N$. For example, I draw a card from a deck of 52. That card holds $-\log_2 \frac{1}{52} = \log_2 52 \approx 5.70044\ldots$ bits of information. A fractional bit can represent a non-power-of-2 possibility space, and quantifies the log-base conversion factor into base $M$. In this case $-log_{52} x = -\frac{\log_2 x}{\log_2 52}$. Note that it is actually common to use units of information other than base-2. For example a <a href="https://en.wikipedia.org/wiki/Nat_(unit)"><em>nat</em></a> is log-base-e, a <a href="https://en.wikipedia.org/wiki/Ternary_numeral_system"><em>trit</em></a> is base-3, and <a href="https://en.wikipedia.org/wiki/Hartley_(unit)"><em>dit</em> or <em>ban</em></a> is base-10.</li>
<li><strong>non-uniform distributions</strong>: Using the deck of cards example, let’s say we draw from a sub-deck containing all cards with the hearts suit. We’ve reduced the possibility space to a subset of a super-space, in this case size 13, and have reduced the information contained in a given card, $-\log_2 \frac{1}{13} \approx 3.70044\ldots$ bits. You can think of this as assigning a weight to each card: 0 for cards we exclude, and $\frac{1}{13}$ for cards we include. If we make the non-zero weights non-uniform, we now have an interpretational issue: what is the physical meaning of these weights? Thinking of this weight as a probability of occurrence is one way to recover physical meaning, but this is <span class="marginnote-outer"><span class="marginnote-ref">not a requirement</span><label for="02ee609808a0f8d62fa190ce247f9b35f70dd990" class="margin-toggle"> ⊕</label><input type="checkbox" id="02ee609808a0f8d62fa190ce247f9b35f70dd990" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">And probability may not even be an objective property of physical systems in general.</span></span></span>. However, I will <span class="marginnote-outer"><span class="marginnote-ref">call these weights probabilities</span><label for="03b6daf7c965c2773f771111cb006d8764629617" class="margin-toggle"> ⊕</label><input type="checkbox" id="03b6daf7c965c2773f771111cb006d8764629617" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The reason we wish to hold sum of weights fixed to 1 is so that we can consider the information contained in compound events which are sets of elementary events. In other words, think of the card drawn from the sub-deck of 13 as a card from <em>any suit</em>, i.e. the set of 4 cards with the same number. The card represents an equivalence class over card number.</span></span></span>, and the weighted-possibility-spaces distributions, as that is the convention. But keep in mind that these weights do not necessarily represent frequencies of occurrence nor uncertainties. The meaning of probability itself is a subject of debate.</li>
</ul>
<p>The reason we wish to hold sum of weights fixed to 1 is so that we can consider the information contained in compound events which are sets of elementary events. In other words, think of the card drawn from the sub-deck of 13 as a card from <em>any suit</em>, i.e. the set of 4 cards with the same number. The card represents an equivalence class over card number.</p>
<p>Let’s examine some of the properties of $h$ to build further intuition.</p>
<p>First notice that $f(1) = 0$. An event with a probability of 1 contains no information. If $x$ is certain to occur, $x$ is uninformative. Likewise, $f(p) \to \infty$ as $p \to 0$. If $x$ is impossible, it contains infinite information! In general, $h(x)$ goes up as $p_X(x)$ goes down. The less likely an event, the more information it contains. Hopefully this sounds to you like a reasonable property of information.</p>
<p>Next, we can be more specific about how $h$ goes up as $p_X$ goes down. Recall that $f(p) = -\log_2 p$ and $h(x) = f(p_X(x))$, then</p>
<script type="math/tex; mode=display">f(p/2) = f(p) + 1\,.</script>
<p>If we halve the probability of an event, we add one bit of information to it. That is a nice way to think about our new unit of information. The <em>bit</em> is a halving of probability. Other units can be defined in this way, e.g. the <em>nat</em> is dividing of probability by Euler’s constant e, the <em>trit</em> is a thirding of probability, etc.</p>
<p>Finally, notice that $f(pq) = f(p) + f(q)$. Or to write it another way: $h(x \and y) = h(x) + h(y)$ iff $x$ and $y$ are independent events, because</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(x \and y) &= -\log_2 p_{X,Y}(x \and y) \\
&= -\log_2 p_X(x)\cdot p_Y(y) \\
&= -\log_2 p_X(x) - \log_2 p_Y(y)\,,
\end{align} %]]></script>
<p>where $x \and y$ indicates the composite event <span class="marginnote-outer"><span class="marginnote-ref">“$x$ and $y$”</span><label for="03c34b59dc7fea1cd51e8dbb51bdcfc9754145fd" class="margin-toggle"> ⊕</label><input type="checkbox" id="03c34b59dc7fea1cd51e8dbb51bdcfc9754145fd" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We could either think of $x$ and $y$ as composite events themselves from the same distribution, i.e. $x$ and $y$ are sets of <a href="https://en.wikipedia.org/wiki/Elementary_event">elementary events</a>, or as elementary events from two different random variables which have a joint distribution, i.e, $(x, y) \sim (\rX, \rY)$. I will consider the latter case from here on out, because it is conceptually simpler.</span></span></span>. Hopefully this is also intuitive. If two events are dependent, i.e. they causally affect each other, it makes sense that they might contain redundant information, meaning that you can predict part of one from the other, and so their combined information is less than the sum of their individual information. You may be surprised to learn that the opposite can also be true. The combined information of two events can be greater than the sum of their individual information! This is called <a href="https://en.wikipedia.org/wiki/Interaction_information#Example_of_negative_interaction_information"><em>synergy</em></a>. More on that in the <a href="#pointwise-mutual-information">pointwise mutual information</a> section.</p>
<p>In short, we can derive $f(p) = -\log_2 p$ from (1) additivity of information, $f(pq) = f(p) + f(q)$, and (2) a choice of unit, $f(½) = 1$. <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)#Rationale">Proof</a>.</p>
<h3 id="recap"><a class="header-anchor" href="#recap">Recap</a></h3>
<p>To make the full analogy: a weighting over possibilities is like a continuous relaxation of a set. An element is or is not in a set, while adding weights to elements (in a larger set) allows their member ship to have degrees, i.e. the <em>“is element”</em> relation becomes a <span class="marginnote-outer"><span class="marginnote-ref">fuzzy value between 0 and 1</span><label for="9c3ecf3b9785719c56a55d9ff267adf0112b2c75" class="margin-toggle"> ⊕</label><input type="checkbox" id="9c3ecf3b9785719c56a55d9ff267adf0112b2c75" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">We recover regular sets by setting all weights to either 0 or uniform non-zero weights.</span></span></span>. With a weighted possibility space we have a lot more freedom to work with extra information beyond just merely which possibilities are in the set. Probability distributions are more expressive than mere sets.</p>
<h2 id="stepping-back"><a class="header-anchor" href="#stepping-back">Stepping back</a></h2>
<p>The unit <em>bit</em> that we’ve defined is connected to computer bits only because they both convert multiplication to addition.</p>
<ul>
<li>Computer bits: $(2^N\cdot2^M)$ states $\Longrightarrow$ $(N+M)$ bits.</li>
<li>Shannon bits: $(p\cdot q)$ probability $\Longrightarrow$ $(-\log_2 p - \log_2 q)$ bits.</li>
</ul>
<p>The way I’ve motivated $h$ is a departure from Shannon’s original motivation for defining self-information, which was to describe the theoretically optimal lossless compression for messages being sent over a communication channel. Under this viewpoint, $h(x)$ quantifies the theoretically minimum possible length (in physical bits) to encode message $x$ in computer memory without loss of information. Under this view, $h(x)$ should be thought of as the asymptotic average bit-length for the optimal encoding of $x$ in an infinite sequence of messages drawn from $p_X$. Hence why it makes sense for $h(x)$ to be a continuous value. For more details, see <a href="https://en.wikipedia.org/wiki/Arithmetic_coding#Connections_with_other_compression_methods">arithmetic coding</a>.</p>
<p>We are now flipping Shannon’s original motivation on its head, and using the theoretically optimal encoding length in bits as the definition of information content. In the following discussion, we don’t care how messages/events are actually represented physically. Our definition of information only cares about probability of occurrence, and is in fact <span class="marginnote-outer"><span class="marginnote-ref">blind to the contents of messages</span><label for="e4d483f3b81edaec6cd9e4f482319dd7556b1855" class="margin-toggle"> ⊕</label><input type="checkbox" id="e4d483f3b81edaec6cd9e4f482319dd7556b1855" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Something that could be seen as either a flaw or a virtue, which I discuss <a href="#5-shannon-information-ignores-the-meaning-of-messages">below</a>.</span></span></span>. The connection of probability to optimal physical encoding is one of the beautiful results that propelled Shannon’s framework into its lofty position as <em>information theory</em>. However, for our purposes, we simply care about defining quantity of information, and do not care at all about how best to compress or store data for practical purposes.</p>
<p>To be clear, when I talk about the self-information of a message, I am not saying anything about how the message is physically encoded or transmitted, and indeed it need not be encoded with an optimal number of computer bits. I am merely referring to a <span class="marginnote-outer"><span class="marginnote-ref">quantified</span><label for="a3caf3737e12ebd497b9d5b99c399b7cd6e84ec9" class="margin-toggle"> ⊕</label><input type="checkbox" id="a3caf3737e12ebd497b9d5b99c399b7cd6e84ec9" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Hopefully this quantity is objective and measurable in principle - something I discuss <a href="#6-probability-distributions-are-not-objective">below</a></span></span></span> property of the message, i.e. it’s information content. The number of computer bits a message is encoded with need not equal the <span class="marginnote-outer"><span class="marginnote-ref">number of Shannon bits it contains!</span><label for="a6072afeef1e2f90ee72e61877bd97c2b4450eeb" class="margin-toggle"> ⊕</label><input type="checkbox" id="a6072afeef1e2f90ee72e61877bd97c2b4450eeb" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In short, physical encoding length and probability of occurrence need not be linked.</span></span></span></p>
<h1 id="entropy"><a class="header-anchor" href="#entropy">Entropy</a></h1>
<p>In the last section I said that under the view of optimal lossless compression, $h(x)$ is the bit length of the optimal encoding for $x$ averaged over an infinite sample from random variable $\rX$, and <a href="https://en.wikipedia.org/wiki/Arithmetic_coding#Connections_with_other_compression_methods">arithmetic coding</a> can approach this limit. We could also consider the average bit length per message from $\rX$ (averaged across all messages). That is the <strong>entropy</strong> of random variable $\rX$, which is the expected self-information,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
H[\rX] &= \E_{x\sim \rX}[h(x)] \\
&= \E_{x\sim \rX}[-\log_2\,p_X(x)]\,.
\end{align} %]]></script>
<p>In the quantifying information view, think of entropy $H[\rX]$ as the number of bits you expect to gain by observing an event sampled from $p_X(x)$. In that sense it is a measure of uncertainty, i.e. how much information I do not have, i.e. quantifying what is unknown.</p>
<p>Let’s build our intuition of entropy. A good way to view entropy is as a measure of how spread out a distribution is. Entropy is actually a type of <a href="https://en.wikipedia.org/wiki/Statistical_dispersion">statistical dispersion</a> of $p_X$, meaning you could use it as an <a href="http://zhat.io/articles/19/bias-variance#what-is-variance-anyway">alternative to statistical variance</a>.</p>
<figure><img src="/assets/posts/primer-shannon-information/bimodal.png" alt="" width="100%" /><figcaption></figcaption></figure>
<p>For example, a bi-modal distribution can have arbitrarily high variance by moving the modes far apart, but the overall spread-out-ness (entropy) will not necessarily change.</p>
<p>The more spread out a distribution is, the higher its entropy. For bounded <a href="https://en.wikipedia.org/wiki/Support_(mathematics)#Support_of_a_distribution">support</a>, the uniform distribution has highest entropy (<a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution#Other_examples.">other max-entropy distributions</a>). The <span class="marginnote-outer"><span class="marginnote-ref">minimum possible entropy is 0</span><label for="61d3d043a10d74a21176f1255203a29c26b0f6f4" class="margin-toggle"> ⊕</label><input type="checkbox" id="61d3d043a10d74a21176f1255203a29c26b0f6f4" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Note that in the expectation, 0-probability outcomes have infinite self-information, so we have to use the convention that $p_X(x)\cdot h(x) = 0\cdot\infty = 0$.</span></span></span>, which indicates a deterministic distribution, i.e. $p_X(x) \in \{0, 1\}$ for all $x \in X$.</p>
<figure><img src="https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg" alt="Credit: <a href="https://en.wikipedia.org/wiki/Binary_entropy_function">https://en.wikipedia.org/wiki/Binary_entropy_function</a>" width="100%" /><figcaption>Credit: <a href="https://en.wikipedia.org/wiki/Binary_entropy_function">https://en.wikipedia.org/wiki/Binary_entropy_function</a></figcaption></figure>
<p>Though Shannon calls his new idea entropy, the connection to physical entropy is nontrivial. If there is a connection, that is more of a coincidence. Apparently Shannon’s decision to call it entropy was made by a suggestion by von Neumann at a party: http://www.eoht.info/page/Neumann-Shannon+anecdote<br />
[credit: Mark Moon]</p>
<p>There are connections between information entropy and thermodynamics entropy (see https://plato.stanford.edu/entries/information-entropy/), but I do not yet understand them well enough to give an overview here - perhaps in a future post. Some physicists consider information to have a physical nature, and even a <span class="marginnote-outer"><span class="marginnote-ref">conservation law</span><label for="4cd9a9099f83eeb0f5a534d111b0875861d2c3ec" class="margin-toggle"> ⊕</label><input type="checkbox" id="4cd9a9099f83eeb0f5a534d111b0875861d2c3ec" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">In the sense that requiring physics be time-symmetric is equivalent to requiring information to be conserved.</span></span></span>! Further reading: <a href="https://theoreticalminimum.com/courses/statistical-mechanics/2013/spring/lecture-1">The Theoretical Minimum - Entropy and conservation of information</a>, <a href="https://en.wikipedia.org/wiki/No-hiding_theorem">no-hiding theorem</a>.</p>
<p><strong>Question</strong>: Why expected self-information?<br />
We could have used median or something else. Expectation is a <span class="marginnote-outer"><span class="marginnote-ref">default go-to operation over distributions</span><label for="e3be61e72d21ae86483befb994084998b611b8df" class="margin-toggle"> ⊕</label><input type="checkbox" id="e3be61e72d21ae86483befb994084998b611b8df" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">See my previous post: <a href="http://zhat.io/articles/bias-variance#bias-variance-decomposition-for-any-loss">http://zhat.io/articles/bias-variance#bias-variance-decomposition-for-any-loss</a></span></span></span> because of its nice properties, but ultimately it is an arbitrary choice. However, as we will see, one huge benefit in our case is that expectation is linear.</p>
<h3 id="regarding-notation-1"><a class="header-anchor" href="#regarding-notation-1">Regarding notation</a></h3>
<p>From here on out, I will drop the subscript $X$ from $p_X(x)$ when $p(x)$ unambiguously refers to the probability of $x$. This is a common thing to do, but it can also lead to ambiguity if I want to write $p(0)$, the probability that $x$ is 0. A possible resolution is to use random variable notation, $p(\rX = 0)$, which I use in some places. However, there is the same issue for self-information. For example, quantities $h(x), h(y), h(x\and y), h(y \mid x)$. I will add subscripts to $h$ when it would be ambiguous otherwise, for example $h_X(0), h_Y(0), h_{X,Y}(x\and y), h_{Y\mid X}(0 \mid 0)$ .</p>
<h2 id="conditional-entropy"><a class="header-anchor" href="#conditional-entropy">Conditional Entropy</a></h2>
<p>Conditional self-information, defined as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(y \mid x) &= -\log_2\,p(y \mid x)\\
&= -\log_2(p(y \and x) / p(x)) \\
&= h(x \and y) - h(x)\,,
\end{align} %]]></script>
<p>is the information you stand to gain by observing $y$ given that you already observed $x$. I let $x \and y$ denote the observation of $x$ and $y$ together (I could write $(x, y)$, but then $p((y, x))$ would look awkward).</p>
<p>If $x$ and $y$ are independent events, $h(y \mid x) = h(y)$. Otherwise, $h(y \mid x)$ can be greater or less than $h(y)$. It may seem counterintuitive that $h(y \mid x) > h(y)$ can happen, because this implies you gain more from $y$ by just simply knowing something else, $x$. However, this reflects the fact that you are unlikely to see $x, y$ together. Likewise, if $h(y \mid x) < h(y)$ you are likely to see $x, y$ together. More on this in the next section.</p>
<p>Confusingly, conditional entropy can refer to two different things.</p>
<p>First is expected conditional self-information,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
H[\rY \mid \rX = x] &= \E_{y\sim \rY \mid \rX=x}[h(y \mid x)] \\
&= \E_{y\sim \rY \mid \rX=x}[\log_2\left(\frac{p(x)}{p(x, y)}\right)] \\
&= \sum\limits_{y \in Y} p(y \mid x) \log_2\left(\frac{p(x)}{p(x, y)}\right)\,.
\end{align} %]]></script>
<p>The other is what is most often referred to as <strong>conditional entropy</strong>,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
H[\rY \mid \rX] &= \E_{x,y \sim \rX,\rY}[h(y \mid x)] \\
&= \E_{x,y \sim \rX,\rY}[\log_2\left(\frac{p(x)}{p(x, y)}\right)] \\
&= \E_{x\sim \rX} H[\rY \mid \rX = x]\,.
\end{align} %]]></script>
<p>The intuition behind $H[\rY \mid \rX = x]$ will be the same as of entropy, $H[\rY]$, which we covered in the last section. Let’s gain some intuition for $H[\rY \mid \rX]$. If $H[\rY]$ measures uncertainty of $\rY$, then $H[\rY \mid \rX = x]$ measures conditional uncertainty given $x$, and $H[\rY \mid \rX]$ measures average conditional uncertainty w.r.t. $\rX$.</p>
<p>The maximum value of $H[\rY \mid \rX]$ is $H[\rY]$, which is achieved when $\rX$ and $\rY$ are independent random variables. This should make sense, as recieving a message from $\rX$ does not tell you anything about $\rY$, so your state of uncertainty does not decrease.</p>
<p>The minimum value of $H[\rY \mid \rX]$ is 0, which is achieved when $p_{\rY \mid \rX}(\rY \mid \rX = x)$ is deterministic for all $x$. In other words, you can define a function $g : X \rightarrow Y$ to map from $X$ to $Y$. This wouldn’t otherwise be the case when $\rY \mid \rX$ is stochastic.</p>
<p>$H[\rY \mid \rX]$ is useful because it takes all $x \in X$ into consideration. You might have, for example, $H[\rY \mid \rX = x_1] = 0$ for $x_1$, but $H[\rY \mid \rX] > 0$, which means $y$ cannot always be deterministically decided from $x$. In the section on mutual information we will see how to think of $H[\rY \mid \rX]$ as a property of a stochastic function from $X$ to $Y$.</p>
<p>Because of linearity of expectation, all identities that hold for self-information hold for their entropy counterparts. For example,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(y \mid x) &= h(x \and y) - h(x) \\
\Longrightarrow H[\rY \mid \rX] &= H[(\rX, \rY)] - H[\rX]\,.
\end{align} %]]></script>
<p>This is a nice result. This equation says that the <span class="marginnote-outer"><span class="marginnote-ref">average uncertainty about $\rY$ given $\rX$</span><label for="40b324ca64f32166d198598020cd16fd3a369058" class="margin-toggle"> ⊕</label><input type="checkbox" id="40b324ca64f32166d198598020cd16fd3a369058" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Amount of information left to observe in $\rY$ on average.</span></span></span> equals the total expected information in their joint distribution, $(\rX, \rY)$, minus the average information in $\rX$. In other words, conditional entropy is the total information in $x \and y$ minus information in what you have, $x$, all averaged over all the possible $(x, y)$ you can have.</p>
<h1 id="mutual-information"><a class="header-anchor" href="#mutual-information">Mutual Information</a></h1>
<p>In my view, mutual information is what holds promise as a definition of information. This it the most important topic to understand for tackling the <a href="#problems-with-shannon-information">problems with Shannon information</a> section below.</p>
<h2 id="pointwise-mutual-information"><a class="header-anchor" href="#pointwise-mutual-information">Pointwise Mutual Information</a></h2>
<!-- Intuitively, if two events are causally connected, i.e. dependent, they contain redundant information combined. meaning that their combined information would be less than the sum of their information. It may also be the case that their combined information could be greater than the sum of their information! This is called *synergy*. We will see examples of this later. -->
<p>When two events $x$ and $y$ are dependent, how do we compute their total information? Previously we said that $h(x \and y) = h(x) + h(y)$ iff $p_X(x \and y) = p_X(x)p_X(y)$. However, the general case is,</p>
<script type="math/tex; mode=display">h(x \and y) = h(x) + h(y) - i(x, y)\,,</script>
<p>where I am defining $i(x, y)$ such that this equation holds. Rearranging we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
i(x, y) &= h(x) + h(y) - h(x \and y) \\
&= -\log_2(p_X(x)) - \log_2(p_X(y)) + \log_2(p_X(x \and y)) \\
&= \log_2\left(\frac{p_X(x, y)}{p_X(x)p_X(y)}\right)\,.
\end{align} %]]></script>
<p>$i(x, y)$ is called <em>pointwise mutual information</em> (PMI). Informally, PMI measures the amount of bits shared by two events. To say that another way, it measures how much information I have about one event given I only observe the other. Notice that PMI is symmetric, $i(x, y) = i(y, x)$, so any two events contain the same information about each other.</p>
<p>$i(x, y)$ is a difference in information. Positive $i(x, y)$ indicates <em>redundancy</em>, i.e. total information is <span class="marginnote-outer"><span class="marginnote-ref">less than the sum of the parts</span><label for="03c34736803c8e2efd1383baf17104684eeab01a" class="margin-toggle"> ⊕</label><input type="checkbox" id="03c34736803c8e2efd1383baf17104684eeab01a" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">If you object that it doesn’t make sense to lose information by observing $x$ and $y$ together over observing them separately, it is important to note that $h(x) + h(y)$ is not a physically meaningful quantity, unless they are independent. Technically, you would have $h(x) + h(y \mid x)$ in total. $h(x)$ and $h(y)$ are both the amounts of information to gain by observing either $x$ or $y$ <strong>first</strong>.</span></span></span>: $h(x, y) < h(x) + h(y)$. However, it may also be the case that $i(x, y)$ is negative so that $h(x, y) > h(x) + h(y)$. <span class="marginnote-outer"><span class="marginnote-ref">This is called <a href="https://en.wikipedia.org/wiki/Synergy#Information_theory"><em>synergy</em></a>.</span><label for="4ce5fcaeebe9d13d3fd8d49c2de4ba62a623e205" class="margin-toggle"> ⊕</label><input type="checkbox" id="4ce5fcaeebe9d13d3fd8d49c2de4ba62a623e205" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The word <em>synergy</em> is conventionally used in the context of expected mutual information, and I am running the risk of conflating two distinct phenomenon under the same word. There is no synergy among two random variables under expected mutual information, and this type of synergy only appears among 3 or more random variables. See <a href="https://en.wikipedia.org/wiki/Multivariate_mutual_information#Synergy_and_redundancy">https://en.wikipedia.org/wiki/Multivariate_mutual_information#Synergy_and_redundancy</a>.</span></span></span></p>
<p>This is highly speculative, but synergy (either the pointwise-MI or expected-MI kind) may be a fundamental insight that could explain <span class="marginnote-outer"><span class="marginnote-ref">emergence</span><label for="c55207622e4799bd3610c39427659c03f63135f5" class="margin-toggle"> ⊕</label><input type="checkbox" id="c55207622e4799bd3610c39427659c03f63135f5" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Emergence is a concept in philosophy. See <a href="https://en.wikipedia.org/wiki/Emergence">https://en.wikipedia.org/wiki/Emergence</a> and <a href="https://plato.stanford.edu/entries/properties-emergent/">https://plato.stanford.edu/entries/properties-emergent/</a></span></span></span> and possible limitations of reductionism in illuminating reality. See <a href="https://www.scottaaronson.com/blog/?p=3294">Higher-level causation exists (but I wish it didn’t)</a>.</p>
<!--
Let's look at (an admittedly contrived) example of synergy. Suppose our [sample space](https://en.wikipedia.org/wiki/Sample_space) is $\\{a, b, c\\}$ (composed of [coutcomes](https://en.wikipedia.org/wiki/Sample_space#Conditions_of_a_sample_space) or [elementary events](https://en.wikipedia.org/wiki/Elementary_event)), and we have two events $x = \\{a, b\\}$ and $y = \\{b, c\\}$. $x, y$ co-occur if we draw outcome $b$. If $p(a) = 7/16, p(b) = 1/8, p(c) = 7/16$, then $p(x) = 9/16$, $p(y) = 9/16$, $p(x \and y) = 1/8$. $h(x) = h(y) \approx 0.83$ and $h(x \and y) = 3$, so $i(x, y) = 2\cdot0.83 - 3 \approx -1.34$ bits.
That may seem like a contrived example, because I was working with composite events instead of elementary events. The same phenomenon can happen for joint distributions of sample spaces. <font color="red">TODO: explain the difference between the example above and a joint distribution.</font>
-->
<p><strong>Example:</strong><br />
Let $X = \{0, 1\}$ and $Y = \{0, 1\}$, then the joint sample space is the cartesian product $X \times Y$. $p_X(x), p_Y(y)$ denote marginal probabilities, and $p_{X,Y}(x, y)$ is their joint probability. The joint probability table:</p>
<table>
<thead>
<tr>
<th style="text-align: center">$x$</th>
<th style="text-align: center">$y$</th>
<th style="text-align: center">$p_{X,Y}(x, y)$</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1/16</td>
</tr>
<tr>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">7/16</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">7/16</td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1/16</td>
</tr>
</tbody>
</table>
<p>We have</p>
<ul>
<li>$h_X(0) = -\log_2 p_X(0) = -\log_2 1/2 = 1$</li>
<li>$h_Y(0) = -\log_2 p_Y(0) = -\log_2 1/2 = 1$</li>
<li>$h_{X,Y}(0 \and 0) = -\log_2 p_{X,Y}(0, 0) = -\log_2 1/16 = 4$</li>
</ul>
<p>$i(0,0) = h_X(0) + h_Y(0) - h_{X,Y}(0 \and 0) = -2$, and so $(0,0)$ is synergistic. On the other hand, $i(0,1) \approx 0.80735$, indicating $(0,1)$ is redundant.</p>
<h2 id="properties-of-pmi"><a class="header-anchor" href="#properties-of-pmi">Properties of PMI</a></h2>
<p>Let’s explore some of the properties of PMI. From here on out, I will consider sampling elementary events from a joint distribution, $(x, y) \sim (\rX, \rY)$, where $\rX, \rY$ are unspecified discrete (possibly infinite) random variables. For notational simplicity I’ll drop the subscripts from distributions, so $p(x), p(y)$ denote the marginals, $\rX$ and $\rY$ respectively, and $p(x, y)$ denotes the joint $(\rX,\rY)$.</p>
<p>To recap, PMI measures the difference in bits between the product of marginals $p(x)p(y)$ and the joint $p(x, y)$, as evidenced by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
i(x, y) &= \log_2\left(\frac{p(x, y)}{p(x)p(y)}\right) \\
&= h(x) + h(y) - h(x \and y)\,.
\end{align} %]]></script>
<p>Negative PMI implies synergy, while positive PMI implies redundancy.</p>
<p>Another way to think about PMI is as a measure of how much $p(y \mid x)$ differs from $p(y)$ (and vice versa). Suppose an oracle sampled $(x, y) \sim (\rX,\rY)$, but the outcome $(x, y)$ remains hidden from you. $p(y)$ is the information you stand to gain by having $y$ revealed to you. However, $p(y \mid x)$ is what you stand to gain from seeing $y$ if $x$ is already revealed. You do not know how much information $x$ contains about $y$ without seeing $y$. Only the oracle knows this. However, if you know $p(y \mid x)$, then you can compute your expected information gain (conditional uncertainty), $H[\rY \mid \rX=x]$.</p>
<p>PMI measures the change in information you will gain about $y$ (from the oracle’s perspective) before and after $x$ is revealed (and vice versa). In this view, it makes sense to rewrite PMI as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
i(x, y) &= \log_2\left(\frac{p(y \mid x)}{p(y)}\right) \\
&= -\log_2\,p(y) + \log_2\,p(y \mid x) \\
&= h(y) - h(y \mid x)\,.
\end{align} %]]></script>
<h3 id="special-values"><a class="header-anchor" href="#special-values">Special Values</a></h3>
<p>By definition, $i(x, y) = 0$ iff $\rX, \rY$ are independent. Verifying, we see that,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
i(x, y) &= \log_2\left(\frac{p(x)p(y)}{p(x)p(y)}\right) \\
&= 0\,.
\end{align} %]]></script>
<p>The maximum possible PMI happens when $x$ and $y$ are perfectly associated, i,e. $p(y \mid x) = 1$ or $p(x \mid y) = 1$. So $h(y \mid x) = 0$ or vice versa, meaning you know everything about $y$ if you have $x$. Then $i(x, y) = h(y) - h(y \mid x) = h(y)$. In general, the maximum possible PMI is $\min\{h(x), h(y)\}$.</p>
<p>PMI has no minimum, and goes to $-\infty$ if $x$ and $y$ can never occur together but can occur separately, i.e. $p(x, y) = 0$ while $p(x), p(y) > 0$. We can see that $p(y \mid x) = p(x, y)/p(x) = 0$ so long as $p(x) > 0$. So $h(y \mid x) \to \infty$, and we have $i(x, y) = h(y) - h(y \mid x) \to -\infty$ if $h(y) > 0$.</p>
<p>While redundancy is bounded, synergy is infinite. This should make sense, as $h(x), h(y)$ are bounded so there is a maximum amount of information to redundantly share. On the other hand, synergy measures how rare the co-occurrence of $(x,y)$ together are, relative to their marginal probabilities, where lower $p(x, y)$ means their co-occurrence is more special. So if $(x,y)$ can never occur, then their co-occurrence is infinitely special.</p>
<h2 id="expected-mutual-information"><a class="header-anchor" href="#expected-mutual-information">Expected Mutual Information</a></h2>
<p>Expected mutual information, also just called mutual information (MI), is given as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
I[\rX, \rY] &= \E_{x\sim X, y\sim Y}[i(x, y)] \\
&= \E_{x\sim X, y\sim Y}\left[\log_2\left(\frac{p(x, y)}{p(x)p(y)}\right)\right]\,.
\end{align} %]]></script>
<p>$I$ is to correlation as $H$ is to variance. While correlation measures to what extent $\rX$ and $\rY$ have a <a href="https://en.wikipedia.org/wiki/Correlation_and_dependence">linear relationship</a>, $I$ measures the strength of their statistical dependency. While variance measures average distance from some critical point, $H$ is distance agnostic, i.e. it measures unordered dispersion. Similarly, while statistical correlation measures deviation of the mapping between $\rX$ and $\rY$ from perfectly linear, $I$ is shape agnostic, i.e. it measures unordered causal dependence.</p>
<p>First off, it is important to point out that $I$ is always non-negative, unlike its pointwise counterpart (proof <a href="https://math.stackexchange.com/a/159544">here</a>). You can see this intuitively by trying to construct an anti-dependent relationship between $\rX$ and $\rY$. On average, $p(x, y)$ would have to be less than the product of their marginals. You can construct individual cases where this is true for a particular $(x, y)$, but to do that, you will have to fill most of the probability table (for 2D joint) with p-mass to compensate. This is reflected in Jensen’s inequality. A direct consequence is $H[\rY] \geq H[\rY \mid \rX]$.</p>
<p>$I$ being non-negative means you can safely think about it as a measure of information content. In this sense, information is stored in the relationship between $\rX$ and $\rY$.</p>
<p>Note that by remembering that expectation is linear, some useful identities pop out of the definition above,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
I[\rX, \rY] &= H[\rX] + H[\rY] - H[(\rX,\rY)] \\
&= H[\rX] - H[\rX \mid \rY] \\
&= H[\rY] - H[\rY \mid \rX]\,.
\end{align} %]]></script>
<p>An intuitive way to think about $I$ is as a continuous measure of <em>bijectivity</em> of the stochastic function, $g(x) \sim p(\rY \mid \rX = x)$, where $g : X \rightarrow Y$. This is easier to see if we write</p>
<script type="math/tex; mode=display">I[\rX, \rY] = H[\rY] - H[\rY \mid \rX]\,.</script>
<p>$H[\rY]$ measures <em>surjectivity</em>, i.e. how much $g$ spreads out over $Y$ (marginalized over $\rX$). <em>surjective</em> (a.k.a. onto) in the set-theory sense means that $g$ maps to every element in $Y$. In the statistical sense, $g$ may map to every element in $Y$ with some probability, but to some elements much more frequently than others. We would say $p(y)$ is <em>peaky</em>, the opposite of spread out. Recall that $H$ measures statistical dispersion. Larger $H[\rY]$ means more even spread of probability mass across all the elements in $Y$. In that sense, it measures how surjective $g$ is.</p>
<p>$H[\rY \mid \rX]$ measures <em>anti-injectivity</em>. <em>injective</em> (a.k.a. one-to-one) in the set-theory sense means that $g$ maps every element in $X$ to a unique element in $Y$. There is no sharing, and you know which $x \in X$ was the input for any $y \in Y$ in the image of $g(X)$. In the statistical sense, $g$ may map a given $x$ to many elements in $Y$, each with some probability, i.e. fan-out. Anti-injective is like a reversal of injective, which is about fan-in. The more $g$ fans-out, the more anti-injective it is. Recall that $H[\rY \mid \rX]$ measures averge uncertainty about $\rY$ given an observation from $\rX$. This is, in a sense, the average statistical fan-out of $g$. Lower $H[\rY \mid \rX]$ means $g$’s output is more concentrated (peaky) on average for a given $x$, and higher means its output is more uniformly spread on average.</p>
<p>For a function to be a bijection, is needs to be both injective and surjective. $H[\rY]$ may seem like a good continuous proxy for surjectivity, but $H[\rY \mid \rX]$ seems to measure something different from injectivity. Notice that $H[\rY \mid \rX]$ is affected by the injectivity of $g^{-1}$. If $g^{-1}$ maps many $y$s to the same $x$, then we are uncertain about what $g(x)$ should be.</p>
<p>In general, I claim that $I[\rX, \rY]$ measures how bijective $g$ is. $I[\rX, \rY]$ is maximized when $H[\rY]$ is maximized and $H[\rY \mid \rX]$ is minimized (i.e. 0). That is, when $g$ is maximally surjective and minimally anti-injective, implying it is maximally injective. Higher $I[\rX, \rY]$ actually does indicate that $g$ is more invertible because $I$ is symmetric. It measures how much information can flow through $g$ in either direction.</p>
<figure><img src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg" alt="Useful diagram for keeping track of the relationships between these concepts.<br/>Credit: <a href="https://en.wikipedia.org/wiki/Mutual_information">https://en.wikipedia.org/wiki/Mutual_information</a>" width="100%" /><figcaption>Useful diagram for keeping track of the relationships between these concepts.<br />Credit: <a href="https://en.wikipedia.org/wiki/Mutual_information">https://en.wikipedia.org/wiki/Mutual_information</a></figcaption></figure>
<p>Useful diagram for keeping track of the relationships between these concepts</p>
<h3 id="channel-capacity"><a class="header-anchor" href="#channel-capacity">Channel capacity</a></h3>
<p>$I$ is determined by $p(x)$ just as much as $p(y \mid x)$, but $g$ has ostensibly nothing to do with $p(x)$. If we want $I$ to measure properties of $g$ in isolation, it should not care about the distribution over its inputs. One solution to this issue is to use the <a href="https://en.wikipedia.org/wiki/Channel_capacity#Formal_definition"><strong>capacity</strong></a> of $g$, defined as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
C[g] &= \sup_{p_X(x)} I[\rX, \rY] \\
&= \sup_{p_X(x)} \E_{y\sim p_{g(x)}, x \sim p_X}[i(x, y)] \\
&= \sup_{p_X(x)} \E_{y\sim p_{Y \mid X=x}, x \sim p_X}[-h(y \mid x) + h(x)]\,.
\end{align} %]]></script>
<p>In other words, if you don’t have a preference for $p(x)$, choose $p(x)$ which maximizes $I[\rX, \rY]$.</p>
<h1 id="shannon-information-for-continuous-distributions"><a class="header-anchor" href="#shannon-information-for-continuous-distributions">Shannon Information For Continuous Distributions</a></h1>
<p>Up to now we’ve only considered discrete distributions. Describing the information content in continuous distributions and their events is tricky business, and a bit more nuanced than usually portrayed. Let’s explore this.</p>
<p>For this discussion, let’s consider a random variable $\rX$ with <a href="https://en.wikipedia.org/wiki/Support_(mathematics)#Support_of_a_distribution">support</a> over $\R$. Let $f(x)$ be the probability density function (pdf) of $\rX$.</p>
<p>Elementary events $x \in \rX$ <span class="marginnote-outer"><span class="marginnote-ref">do not have probabilities perse</span><label for="81ebfb9082ef06c6091bdf79ef229fe6555037fd" class="margin-toggle"> ⊕</label><input type="checkbox" id="81ebfb9082ef06c6091bdf79ef229fe6555037fd" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">you could say their probability mass is 0 in the limit</span></span></span>. Self-information is a function of probability mass, so we should instead compute self-info of events that are intervals (or measurable sets) over $\R$. For example,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(a < x < b) &= -\log_2\,p(a < \rX < b)\\
&= -\log_2\left(\int_a^b f(x) \diff x\right)
\end{align} %]]></script>
<p>Conjecture: The entropy of any distribution with uncountable support is infinite. This should make sense, as we now have uncountably many possible outcomes. One observation rules out infinitely many alternatives, so it should contain infinite information. We can see this clearly because the entropy of a uniform distribution over $N$ possibilities is $\log_2 N$ which grows to infinity as $N$ does. On the other hand, a one-hot distribution over $N$ possibilities has 0 entropy, because you will <span class="marginnote-outer"><span class="marginnote-ref">always observe</span><label for="2b312968d2ec3765f14ce161dd47e1212e3f03cc" class="margin-toggle"> ⊕</label><input type="checkbox" id="2b312968d2ec3765f14ce161dd47e1212e3f03cc" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Unless you observe an impossible outcome, in which case you gain infinite information!</span></span></span> the probability-1 outcome and gain 0 information. So we expect the Dirac-delta distribution to have 0 entropy.</p>
<p>But wait, the Gaussian distribution is a <a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution">maximum-entropy</a> distribution. That people can say “a continuous distribution has maximum entropy” implies their entropies can be numerically compared! And frankly, people talk about entropy of continuous distributions all the time, and they are very much finite! It turns out, what people normally call entropy for continuous distributions is actually <a href="https://en.wikipedia.org/wiki/Differential_entropy">differential entropy</a>, which is not the same thing as the $H$ we’ve been working with.</p>
<p>I’ll show that $H[\rX]$ is infinite when the distribution has continuous support, <span class="marginnote-outer"><span class="marginnote-ref">following a similar proof in <a href="https://www.crmarsh.com/static/pdf/Charles_Marsh_Continuous_Entropy.pdf">Introduction to Continuous Entropy</a></span><label for="6a33d3d476c15c73c5fec235f81ff2c69eec7113" class="margin-toggle"> ⊕</label><input type="checkbox" id="6a33d3d476c15c73c5fec235f81ff2c69eec7113" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">And also in Elements of Information Theory section 8.3.</span></span></span>. To do that, let’s take a <a href="https://en.wikipedia.org/wiki/Riemann_sum">Riemann sum</a> of $f(x)$. Let $\{x_i\}_{i=-\infty}^\infty$ be a set of points equally spaced by intervals of $\Delta$.</p>
<script type="math/tex; mode=display">% <![CDATA[
% \def\u{\Delta x}
\def\u{\Delta}
\begin{align}
H[\rX] &= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(f(x_i) \u\right) \\
&= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(f(x_i)\right) - \lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(\u\right)\,.
\end{align} %]]></script>
<p>The left term is just the Riemann integral of $f(x)\log_2(f(x))$, which I will define as <span class="marginnote-outer"><span class="marginnote-ref"><strong>differential entropy</strong></span><label for="642b79d7e276fc9865eef223339f4eb4b1a72b14" class="margin-toggle"> ⊕</label><input type="checkbox" id="642b79d7e276fc9865eef223339f4eb4b1a72b14" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Typically $h$ is used to denote differential entropy, but I’ve already used it for self-information, so I’m using $\eta$ instead.</span></span></span>:</p>
<script type="math/tex; mode=display">\eta[f] := -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(f(x_i)\right) = -\int_{-\infty}^\infty f(x) \log_2\left(f(x)\right) \diff{x}\,.</script>
<p>The right term can be simplified using the <a href="https://tutorial.math.lamar.edu/Classes/CalcI/LimitsProperties.aspx">limit product rule</a>:</p>
<script type="math/tex; mode=display">-\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u \log_2\left(\u\right) = -\left(\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u\right)\cdot\left(\lim\limits_{\u \to 0}\log_2\left(\u\right)\right)\,.</script>
<p>Note that</p>
<p><script type="math/tex">\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty f(x_i) \u = \int_{-\infty}^\infty f(x) \diff{x} = 1\,,</script><br />
because $f(x)$ is a p.d.f.</p>
<p>Putting it all together we have</p>
<script type="math/tex; mode=display">H[\rX] = \eta[f] - \lim\limits_{\u \to 0}\log_2\left(\u\right)\,.</script>
<p>$\log_2(\u) \to -\infty$ as $\u \to 0$, so $H[\rX]$ explodes to infinity when $\eta[f]$ is finite, which it is for most well-behaved functions.</p>
<p>A simple proof that $H$ is finite for continuous distributions with support over an finite set: the Riemann sum above will only have at most finitely many non-zero terms as $\Delta \to \infty$.</p>
<p>Differential entropy is very different from entropy. It can be unboundedly negative. For example, the differential entropy of a Gaussian distribution with variance $\sigma^2$ is $\frac{1}{2}\ln(2\pi e \sigma^2)$. Taking the limit as $\sigma \to 0$, we see the differential entropy of the <span class="marginnote-outer"><span class="marginnote-ref">Dirac-delta distribution is $-\infty$</span><label for="db0f66f8ebc12d35b9801d084e4ee4afbd893dc5" class="margin-toggle"> ⊕</label><input type="checkbox" id="db0f66f8ebc12d35b9801d084e4ee4afbd893dc5" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">Plugging $\eta[f] = -\infty$ into our relation $H[\rX] = \eta[f] - \lim\limits_{\u \to 0}\log_2\left(\u\right)$, we see why entropy of $\delta(x)$ would be 0.</span></span></span>. A notable problem with differential entropy is that its not invariant to change of coordinates, and there is a proposed fix for that: <a href="https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points">https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points</a>.</p>
<h2 id="proof-that-mi-is-fininte-for-continuous-distributions"><a class="header-anchor" href="#proof-that-mi-is-fininte-for-continuous-distributions">Proof that MI is fininte for continuous distributions</a></h2>
<p>A very nice result is that expected mutual information is finite where entropy would be infinite, so long as there is some amount of noise between the two random variables. This implies that even if physical processes are continuous and contain infinite information, we can only get finite information out of them, because measurement requires establishing a statistical relation between the measurement device and that system which is always noisy in reality. MI is agnostic to discrete or continuous universes! As long as there is some amount of noise in between a system and your measurement, your measurement will contain finite information about the system.</p>
<p>The proof follows the same Riemann sum approach from the previous section. I will show that mutual information and differential mutual information are equivalent. Since differential mutual information finite for well behaved functions, so is mutual information!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
I[\rX, \rY] &= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty \sum\limits_{j=-\infty}^\infty f_{XY}(x_i, y_j) \u^2 \log_2\left(\frac{f_{XY}(x_i, y_j)\u^2}{f_X(x_i)\u f_Y(y_i)\u} \right) \\
&= -\lim\limits_{\u \to 0} \sum\limits_{i=-\infty}^\infty \sum\limits_{j=-\infty}^\infty f_{XY}(x_i, y_j) \u^2 \log_2\left(\frac{f_{XY}(x_i, y_j)}{f_X(x_i)f_Y(y_i)} \right) \\
&= \int_{-\infty}^\infty \int_{-\infty}^\infty f_{XY}(x_i, y_j) \log_2\left(\frac{f_{XY}(x_i, y_j)}{f_X(x_i)f_Y(y_i)} \right) \diff{y}\diff{x}\,
\end{align} %]]></script>
<p>because the $\Delta$s cancel inside the log.</p>
<p>If $p(\rY \mid \rX = x)$ is a Dirac-delta for all $x$, and $p(\rY)$ has continuous support, then $I[\rX, \rY]= H[\rY] - H[\rY \mid \rX] = \infty$ because $H[\rY]=\infty$ and $H[\rY \mid \rX]=0$. Thus some noise between $\rX$ and $\rY$ is required to make the MI finite. It follows that $I[\rX, \rX] = H[\rX] = \infty$ when $\rX$ has continuous support.</p>
<h1 id="problems-with-shannon-information"><a class="header-anchor" href="#problems-with-shannon-information">Problems With Shannon Information</a></h1>
<p><strong>Question:</strong> Do the concepts just outlined capture our colloquial understanding of information? Are there situations where they behave differently from how we expect information to behave? I’ll go through some fairly immediate objections to this Shannon’s definition of information, and some remedies.</p>
<h2 id="1-tv-static-problem"><a class="header-anchor" href="#1-tv-static-problem">1. TV Static Problem</a></h2>
<p>Imagine a TV displaying static noise. If we assume a fairly uniform distribution over all “static noise” images, we know that the entropy of the TV visuals will be high, because probability mass is spread fairly evenly across all possible images. Each image on average has a very low probability of occurring. According to Shannon, each image then contains a large amount of information.</p>
<p>That may sound absurd. <a href="https://en.wikipedia.org/wiki/Noise_(signal_processing)">Noise</a>, by some definitions, carries no useful information. Noise is uninformative. To a human looking at TV static, the information gained is that the TV is not displaying anything. This is a very high level piece of information, but much less than the supposedly high information content of the static itself.</p>
<figure><img src="/assets/posts/primer-shannon-information/tv-static.png" alt="" width="100%" /><figcaption></figcaption></figure>
<p>The resolution here is to define what it means for a human to obtain information. I propose looking at the mutual information between the TV and the viewer’s brain. Let $\rX$ be a random variable over TV images, and $\rZ$ be a random variable over the viewer’s brain states. The support of $\rX$ is the space of all possible TV screens, so static and SpongeBob are just different distributions over the same space. Now, the state of the viewer’s brain is causally connected to what is on the TV screen, but the nature of their visual encoder (visual cortex) determines $p(\rZ \mid \rX)$, and thus $p(\rZ, \rX)$. I would guess that any person who says TV static is uninformative does not retain much detail about the patterns in the static. Basically, that person would just remember that they saw static. What we have here is a region of large fan-in. Many static images are collapsed to a single output for their visual encoder, namely the label “TV noise”. So the information contained in TV static is low to a human, because $I[\rX, \rZ]$ is low when $\rX$ is the distribution of TV static.</p>
<p>Note that the signal, “TV noise”, is still rather informative, if you consider the space of all possible labels you could assign to the TV screen, e.g. “SpongeBob” or “sitcom”. Further, that you are looking at a TV and not anything else is information.</p>
<h2 id="2-shannon-information-is-blind-to-scrambling"><a class="header-anchor" href="#2-shannon-information-is-blind-to-scrambling">2. Shannon Information is Blind to Scrambling</a></h2>
<p>Encryption scrambles information to make it inaccessible to prying eyes. Encryption is usually lossless, meaning the original message is fully recoverable. If $\rX$ is a distribution over messages, then the encryption function Enc should preserve that distribution. To Shannon information, $\rX$ and $\text{Enc}(\rX)$ contain the same information. Shannon information is therefore blind to operations like scrambling which do something interesting to the information present, i.e. like making it accessible or inaccessible.</p>
<p>The resolution is again mutual information. While permuting message space (or any bijective transformation) does not change information content under Shannon, it changes the useful information content. A human looking at (or otherwise perceiving) a message is creating a casual link between the message and a representation in the brain. This link has mutual information. Likewise, any measurement apparatus establishes a link between physical state and a representation of that state (the measurement result), again establishing mutual information.</p>
<p>Information in a message becomes inaccessible or useless when the representation of the message cannot distinguish between two messages. Encryption maps the part of message space that human brains can discriminate, i.e. meaningful English sentences (or other such meaningful content) to a part of message space that humans cannot discriminate, i.e. apparently arbitrary character strings. These arbitrary strings appear to be meaningless because they are all mapped to the same or similar representation in our heads, namely the “junk text” label. In short, mutual information between plain text and brain states is much higher than mutual information between encrypted text and brain states.</p>
<h2 id="3-deterministic-information"><a class="header-anchor" href="#3-deterministic-information">3. Deterministic information</a></h2>
<p>How is data on disk contain information if it is fixed and known? How does the output of a deterministic computer program contain information? How do math proofs contain information? All these things do not have an inherent probability distribution. If there is uncertainty, we might call it <span class="marginnote-outer"><span class="marginnote-ref">logical uncertainty</span><label for="0e18784002a64f3be42ab79f74a4282e0d11c58f" class="margin-toggle"> ⊕</label><input type="checkbox" id="0e18784002a64f3be42ab79f74a4282e0d11c58f" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">See <a href="https://intelligence.org/2016/04/21/two-new-papers-uniform/">New papers dividing logical uncertainty into two subproblems</a><br />and <a href="https://golem.ph.utexas.edu/category/2016/09/logical_uncertainty_and_logica.html">Logical Uncertainty and Logical Induction</a></span></span></span>. It is an open question whether logical uncertainty and empirical uncertainty should be conflated, and both brought under the umbrella of probability theory.</p>
<p>This is similar to asking, how does Shannon information account for what I already know? When I observe a message I didn’t already know it is informative, but what about the information contained in messages I currently have? It is also an open question whether probability should be considered objective or subjective, and whether quantities of information are objective or subjective. Perhaps you regard a message you have to be informative, because you are implicitly modeling its information w.r.t. some other receiver who has not yet received it.</p>
<h2 id="4-if-the-universe-is-continuous-everything-contains-infinite-information"><a class="header-anchor" href="#4-if-the-universe-is-continuous-everything-contains-infinite-information">4. If the universe is continuous everything contains infinite information</a></h2>
<p>This one is resolved by the discussion above about mutual information of continuous distributions being finite, so long as there is noise between the two random variables. Thus, in a universe where all measurements are noisy, mutual information is always finite regardless of the underlying meta-physics (whether objects contain finite or infinite information in an absolute sense).</p>
<h2 id="5-shannon-information-ignores-the-meaning-of-messages"><a class="header-anchor" href="#5-shannon-information-ignores-the-meaning-of-messages">5. Shannon information ignores the meaning of messages</a></h2>
<p>There is a competing information theory, <a href="https://en.wikipedia.org/wiki/Algorithmic_information_theory">algorithmic information theory</a> which uses the length of the shortest program that can output a message $x$ as the information measure of $x$, called <a href="https://en.wikipedia.org/wiki/Kolmogorov_complexity">Kolmogorov complexity</a>. If $x$ is less compressible, it contains more information. This is analogous to low $p_X(x)$ leading to its optimal <a href="https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding">Shannon-Fano</a> being longer, and thus containing more information.</p>
<p>Algorithmic information theory addresses the criticism that $h(x)$ depends only on the probability of $x$, rather than the meaning of $x$. If $x$ is a word, sentence, or even a book, the information content of $x$ supposedly does not depend on what the text is! Algorithmic information theory defines information as a property of the content of $x$ as a string, and drops the dependency on probability.</p>
<p>I think this criticism does not consider what <em>meaning</em> is. A steel-man’ed Shannon information at least seems self-consistent to me. Again, the right approach is to use mutual information. I propose that the meaning of a piece of text is ultimately due to the brain state it invokes in you when you read it. Your <a href="https://www.deeplearningbook.org/contents/representation.html">representation</a> of the text shares information with the text. So while yes the probability of $x$ in the void may be meaningless, the joint probability of $(x, z)$ where $z$ is your brain state is what gives $x$ meaning. Shannon information being blind to what we are calling the contents of a message can be seen as a virtue. In other words, Shannon is blind to <em>preconceived</em> meaning. While statistical variance cares about the Euclidean distance between points in $\mathbb{R}^n$, entropy does not and should not if the mathematical representation of these points as vectors is not important. Shannon does not care what you label your points! Their meaning comes solely from their co-occurrence with other random variables.</p>
<p>I think condensing a string of text, like a book, into one random variable $\rX$ is very misleading, because this distribution factors! A book is a single outcome from a distribution over all strings of characters, and we write this distribution as $p(\rC_i \mid \rC_{i-1}, \ldots, \rC_2, \rC_1)$ where $\rC_i$ is the random variable for the $i$-th character in the book. In this way, each character position contains semantic information in its probability distribution conditioned on the previous character choices. The premise of <a href="https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf">language modeling</a> in machine learning is that the statistical relationships between words (their frequencies of co-occurrence) in a corpus of text <span class="marginnote-outer"><span class="marginnote-ref">determine their meaning</span><label for="420e4613ec913eb3ebc7f105b4ba7df0378bbd7b" class="margin-toggle"> ⊕</label><input type="checkbox" id="420e4613ec913eb3ebc7f105b4ba7df0378bbd7b" class="margin-toggle" /><span class="marginnote"><span class="marginnote-inner">The theory goes that a computer which can estimate frequencies of words very precisely would implicitly have to create internal representations of those words which encode their meaning, and so beefed up language modeling is all that is needed for intelligence.</span></span></span></p>
<h2 id="6-probability-distributions-are-not-objective"><a class="header-anchor" href="#6-probability-distributions-are-not-objective">6. Probability distributions are not objective</a></h2>
<p>I touched on this already. Probability has two interpretations: frequentist (objective) and Bayesian (subjective). It is unclear if frequentist probability is an objective property of matter. For repeatable controlled experiments, a frequentist description is reasonable, like in games of chance, and in statistical mechanics and quantum mechanics. When probability is extended to systems that don’t repeat in any meaningful sense, like the stock market or historical events, the objectiveness is dubious. There is a camp that argues probability should reflect the state of belief of an observer, and is more a measurement of the brain doing the observing than the thing being observed.</p>
<p>So then this leads to an interesting question: is Shannon information a property of a system being observed, or a property of the observer in relation to it (or both together)? Is information objective in the sense that multiple independent parties can do measurements to verify a quantity of information, or is it subjective in the sense that it depends on the beliefs of the person doing the calculating? I am not aware of any answer or consensus on this question for information in general,</p>
<h1 id="appendix"><a class="header-anchor" href="#appendix">Appendix</a></h1>
<h2 id="properties-of-conditional-entropy"><a class="header-anchor" href="#properties-of-conditional-entropy">Properties of Conditional Entropy</a></h2>
<p>Source: https://en.wikipedia.org/wiki/Conditional_entropy#Properties</p>
<p>$H[\rY \mid \rX] = H[(\rX, \rY)] - H[\rX]$</p>
<p>Bayes’ rule of conditional entropy:<br />
$H[\rY \mid \rX] = H[\rX \mid \rY] - H[\rX] + H[\rY]$</p>
<p>Minimum value:<br />
$H[\rY \mid \rX] = 0$ when $p(y \mid x)$ is always deterministic, i.e. one-hot, i.e. $p(y \mid x) \in \{0, 1\}$ for all $(x, y) \in X \times Y$.</p>
<p>Maximum value:<br />
$H[\rY \mid \rX] = H[\rY]$ when $\rX, \rY$ are independent.</p>
<h2 id="bayes-rule"><a class="header-anchor" href="#bayes-rule">Bayes’ Rule</a></h2>
<script type="math/tex; mode=display">p(y \mid x) = p(x \mid y)p(y)/p(x)</script>
<p>can be rewritten in terms of self-information:</p>
<script type="math/tex; mode=display">h(y \mid x) = h(x \mid y) + h(y) - h(x)\,.</script>
<p>The information contained in $y$ given $x$ is proportional to the information contained in $x$ given $y$ plus the information contained in $y$. This is just Bayes’ rule in log-space, but makes it a bit easier to reason about what Bayes’ rule is doing. Whether $y$ is likely in its own right and whether $x$ is likely given $y$ both contribute to the total information.</p>
<h2 id="cross-entropy-and-kl-divergence"><a class="header-anchor" href="#cross-entropy-and-kl-divergence">Cross Entropy and KL-Divergence</a></h2>
<p>Unlikely everything we’ve seen so far, these are necessarily functions of probability functions, rather than random variables. Further, these are both comparisons of probability functions over the same support.</p>
<script type="math/tex; mode=display">H[P,Q] = -\sum_x P(x)\log Q(x)</script>
<script type="math/tex; mode=display">\kl{P}{Q} = \sum_{x} P(x)\log
{\frac{P(x)}{Q(x)}}</script>
<script type="math/tex; mode=display">\kl{P}{Q} = H[P,Q] - H[P]</script>
<p>Sources:</p>
<ul>
<li><a href="https://stats.stackexchange.com/questions/111445/analysis-of-kullback-leibler-divergence">https://stats.stackexchange.com/questions/111445/analysis-of-kullback-leibler-divergence</a></li>
<li><a href="https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence">https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence</a></li>
</ul>
<p>Mutual information can be <a href="https://en.wikipedia.org/wiki/Mutual_information#Relation_to_Kullback%E2%80%93Leibler_divergence">written in terms of KL-divergence</a>:</p>
<script type="math/tex; mode=display">I[\rX, \rY] = \kl{p_{X,Y}}{p_X \cdot p_Y} = \E_{x \sim \rX}\left[\kl{p_{Y\mid X}}{p_Y}\right]\,,</script>
<p>where $(p_X \cdot p_Y)(x, y) \mapsto p_X(x) \cdot p_Y(x)$ and $p_{Y\mid X}(y \mid x) \mapsto p_{X,Y}(x,y)/p_X(x)$.</p>
<h1 id="acknowledgments"><a class="header-anchor" href="#acknowledgments">Acknowledgments</a></h1>
<p>I would like to thank John Chung for extensive and Aneesh Mulye for excruciating feedback on the structure and language of this post.</p>
Tue, 09 Jun 2020 00:00:00 -0700
pragmanym.github.io/zhat/articles/primer-shannon-information
pragmanym.github.io/zhat/articles/primer-shannon-informationpostNotes: Wallace - Emergence of particles from QFTFri, 10 Jan 2020 00:00:00 -0800
pragmanym.github.io/zhat/articles/wallace-particles-qft
pragmanym.github.io/zhat/articles/wallace-particles-qftNotesNotes: Visualizing Quantum Field StatesFri, 10 Jan 2020 00:00:00 -0800
pragmanym.github.io/zhat/articles/visualizing-quantum-fields
pragmanym.github.io/zhat/articles/visualizing-quantum-fieldsNotesNotes: Solomonoff InductionTue, 31 Dec 2019 00:00:00 -0800
pragmanym.github.io/zhat/articles/solomonoff-induction
pragmanym.github.io/zhat/articles/solomonoff-inductionNotesNotes: Weak Measurement (Quantum Mechanics)Tue, 24 Dec 2019 00:00:00 -0800
pragmanym.github.io/zhat/articles/weak-measurement
pragmanym.github.io/zhat/articles/weak-measurementNotesNotes: Topology - Sphere & TorusMon, 23 Dec 2019 00:00:00 -0800
pragmanym.github.io/zhat/articles/topology-sphere-torus
pragmanym.github.io/zhat/articles/topology-sphere-torusNotes