In MacKay (2003) on page 2, the following straightforward approximation for a binomial coefficient is introduced: \[\begin{equation}
\log \binom{N}{r} \simeq(N-r) \log \frac{N}{N-r}+r \log \frac{N}{r}.
\end{equation}\] The derivation in the book is short but not very intuitive although it feels like it should be. Information theory would be the likely candidate to provide intuitions. But information-theoretic quantities like entropies do not apply to fixed observations, only random variables, or do they?
We will derive Stirling’s approximation for binomial coefficients using a neat notation for information-theoretical quantities . It allows quantifying information-theoretic quantities for more than just random variables. At the same time, we will see that this might allow using other tools from probability theory to estimate the approximation error. This might be interesting beyond this toy example.
\(
\require{newcommand}
\require{mathtools}
\newcommand{ifblank}[3]{#3}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\defeq}{\vcentcolon=}
\DeclareMathOperator{\opExpectation}{\mathbb{E}}
% Expectation over #1 of #2
\DeclarePairedDelimitersXPP{\E}[2]{\opExpectation_{#1}}{\lbrack}{\rbrack}{}{ %
\ifblank{#2}{\:\cdot\:}{#2}
}
%\newcommand{\E}[2]{\opExpectation_{#1} \left [ \ifblank{#2}{\:\cdot\:}{#2} \right ]}
% Expectation over #1 without brackets
\newcommand{\simpleE}[1]{\opExpectation_{#1}} % {\ifblank{#2}{\left [ \:\cdot\: \right ]}{#2}}}
% can be useful to refer to this outside \Set
\newcommand\MidSymbol[1][]{ %
\nonscript\:#1
%\allowbreak
\nonscript\:
\mathopen{}
}
% Helper commands for specifying conditionals etc
% (from mathtools documentation p 28)
% Make sure it exists
\newcommand\given{\MidSymbol[\vert]}
%% Shannon's Information Content
\DeclareMathOperator{\opInformationContent}{h}
\DeclarePairedDelimitersXPP{\ICof}[1]{\opInformationContent}{(}{)}{}{ %
\ifblank{#1}{\:\cdot\:}{#1}
}
%% Entropy
\DeclareMathOperator{\opEntropy}{H}
\DeclarePairedDelimitersXPP{\Hof}[1]{\opEntropy}{[}{]}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#1}{\:\cdot\:}{#1}
}
\DeclarePairedDelimitersXPP{\xHof}[1]{\opEntropy}{(}{)}{}{ %
\ifblank{#1}{\:\cdot\:}{#1}
}
\DeclareMathOperator{\opMI}{I}
\DeclarePairedDelimitersXPP{\MIof}[1]{\opMI}{[}{]}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#1}{\:\cdot\:}{#1}
}
\DeclareMathOperator{\opSuprise}{S}
\DeclarePairedDelimitersXPP{\Sof}[1]{\opSuprise}{[}{]}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#1}{\:\cdot\:}{#1}
}
%% Cross-Entropy
\DeclarePairedDelimitersXPP{\CrossEntropy}[2]{\opEntropy}{(}{)}{}{ %
\ifblank{#1#2}{\:\cdot\: \MidSymbol[\delimsize\Vert] \:\cdot\:}{#1 \MidSymbol[\delimsize\Vert] #2}
}
%\newcommand{\CrossEntropy}[2]{\E{#1}{-\log(#2)}}
%% Kullback-Leibler Divergence
\DeclareMathOperator{\opKale}{D_\mathrm{KL}}
\DeclarePairedDelimitersXPP{\Kale}[2]{\opKale}{(}{)}{}{ %
\ifblank{#1#2}{\:\cdot\: \MidSymbol[\delimsize\Vert] \:\cdot\:}{#1 \MidSymbol[\delimsize\Vert] #2}
}
%% Probabilities
\DeclareMathOperator{\opp}{p}
\DeclarePairedDelimitersXPP{\pof}[1]{\opp}{(}{)}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#1}{\:\cdot\:}{#1}
}
\DeclarePairedDelimitersXPP{\pcof}[2]{\opp_{#1}}{(}{)}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#2}{\:\cdot\:}{#2}
}
\DeclarePairedDelimitersXPP{\hpcof}[2]{\hat{\opp}_{#1}}{(}{)}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#2}{\:\cdot\:}{#2}
}
\DeclareMathOperator{\opq}{q}
\DeclarePairedDelimitersXPP{\qof}[1]{\opq}{(}{)}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#1}{\:\cdot\:}{#1}
}
%% Variational Entropy (Cross-Entropy)
\DeclarePairedDelimitersXPP{\varHof}[2]{\opEntropy_{\ifblank{#1}{\:\cdot\:}{#1}}}{[}{]}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#2}{\:\cdot\:}{#2}
}
\DeclarePairedDelimitersXPP{\xvarHof}[2]{\opEntropy_{\ifblank{#1}{\:\cdot\:}{#1}}}{(}{)}{}{ %
\renewcommand\given{\MidSymbol[\delimsize\vert]}
\ifblank{#2}{\:\cdot\:}{#2}
}
\Hof{B, r} = \Hof{B \given r}
\)
Notation
A quick revision of the notation in “A Practical & Unified Notation for Information-Theoretic Quantities in ML ”. It allows for information quantities between (untied) random variables and tied random variables with observed outcomes.
All quantities, which we will use below, can be consistently defined as follows:
(For random variables \(R\) and \(B\) with outcomes \(r\) and \(b\), respectively, we shorten “\(R=r\)” and “\(B=b\)” to “\(r\)” and “\(b\)” in expressions.)
- \(H[r] := h(p(r)) := -\log p(r),\)
- \(H[b,r] := h(p(b,r)) = -\log p(b, r),\)
- \(H[b \mid r ] := h(p(b \mid r)) = -\log p(b \mid r),\)
- \(H[B,r] := \mathbb{E}_{p(b|r)} H[b, r] = -\mathbb{E}_{p(b|r)} \log p(b, r),\) and
- \(H[B \mid r] := \mathbb{E}_{p(b|r)} H[b \mid r] = -\mathbb{E}_{p(b|r)} \log p(b \mid r).\)
\(\Hof{r}\) denotes the number of nats to encode event \(R=r\), given probability distribution \(\pof{r}\). Similarly, \(\Hof{B, r}\) denotes the number of nats to encode both \(B\) and \(R\) when \(R=r\) unbeknownst to encoding scheme.
The main idea behind this notation is that it is consistent with taking expectations over the quantities to obtain regular entropies and so on. Most rules for information-theoretic quantities also directly apply to quantities that mix random variables and outcomes. More information and intuitions are found in the paper.
Figure 1. The relationship between between the information quantities. \(B\) is the joint of the Bernoulli random variables, \(R\) is the number of successes in \(B\) with observed outcome \(r\) (and thus follows a Binomial distribution). The arrow below \(\Hof{r}\) symbolizes that we minimize \(\Hof{r}\) by optimizing the success probability \(\rho\) to close the gap between \(\simpleE{\pof{b \given r}}{\Hof{b}}\) and \(\Hof{B \given r}\). This follows the diagrams in MacKay (2003) , e.g. Figure 8.1.
We will start by introducing a simple probabilistic model that will allow us to formulate the approximation as an upper-bound. We will then show how the upper-bound follows from dropping one term and then pick the success probability to tighten the bound as much as possible. This sounds more complicated than it is but is a fun way to show some tools that are often employed in mathematics.
We will relate the quantities we examine to information theory (as a means of quantifying optimal communication) by asking: what message does the quantity represent? This will hopefully provide additional intuitions to the reader.
Setup. Let \(B_1, \ldots, B_N\) be \(N\) Bernoulli random variables with success probability \(\rho\), and let \(B\) be the joint of these random variables.
Further, let \(R\) be the random variable that counts the number of successes in \(B\). \(R\) follows a Binomial distribution with success probability \(\rho\) and \(N\) trials.
Main Idea. For a given outcome \(r\) of \(R\), we have: \[\begin{equation}
\Hof{B, r} = \Hof{B \given r} + \Hof{r} \ge \Hof{B \given r},
\end{equation}\] as \(\Hof{\cdot}\) is non-negative for discrete random variables. We will examine this inequality to obtain the approximation.
Note that \(\Hof{B \given r}\) is the additional number of bits needed to encode \(B\) when the number of successes is already known. Similarly, \(\Hof{B, r}\) is the number of bits needed to encode both \(B\) and \(R\) under the circumstance that \(R=r\).
Determining \(\Hof{B, r}\). \(R\) is fully determined by \(B\), and thus we have \(\Hof{B, R}=\Hof{B}\) and hence also This also follows immediately from \(\Hof{R \given B} = 0 \implies \forall r: \Hof{r \given B} = 0\). : \[\begin{equation}
\Hof{B, r} = \simpleE{\pof{b \given r}}{\Hof{b}}.
\end{equation}\] \(\simpleE{\pof{b \given r}}{\Hof{b}}\) is the expected number of bits needed to transmit the outcome \(b\) of \(B\) when \(r\) is given. When we encode \(b\), we do not know \(r\) upfront, so we need to transmit \(N\) Bernoulli outcomes. Hence, we need to transmit \(r\) successes and \(N-r\) failures. Given the success probability \(\rho\), the optimal message length for this is: \[\begin{align}
&\simpleE{\pof{b \given r}}{\Hof{b}} = r \, \ICof{\rho} + (N-r)\, \ICof{1-\rho} \\
& \quad = - r \log \rho - (N - r) \log (1-\rho).
\end{align}\] All this is visualized in Figure 1.
Alternative Argument. We can also look at the terms \(\Hof{B \given r} + \Hof{r}\) separately. We have \[\begin{equation}
\Hof{r} = -\log \pof{r} = - \log \left ( \binom{N}{r} \, \rho^r \,(1-\rho)^{N-r}\right ),
\end{equation}\] and \[\begin{equation}
\Hof{B \given r} = -\simpleE{\pof{b \given r}} \log \pof{b \given r} =
\log \binom{N}{r}.
\end{equation}\] The former follows from \(R\) being binomially distributed.
For the latter, we observe that we need to encode \(B\) while knowing \(r\) already. Given \(r\), \(\pof{b \given r} = \text{const}\) for all valid \(b\). There are \(\binom{N}{r}\) possible \(b\) for fixed \(r\). Hence, we can simply create a table with all possible configurations with \(r\) successes. There are \(\binom{N}{r}\) many. We then encode the index into this table. Each configuration with \(r\) successes has an equal probability, so we have a uniform discrete distribution with entropy \(\log \binom{N}{r}\) and obtain the same result.
Determining \(\rho\). With this, we are almost done. We already have \[\begin{align}
\Hof{B \given r} + \Hof{r} &= -r \log \rho - (N - r) \log (1-\rho) \notag \\
&\ge
\log \binom{N}{r} = \Hof{B \given r}.
\end{align}\] How do we make this inequality as tight as possible?
We need to minimize the gap \(\Hof{r}\) which creates the inequality in the first place, and \(\Hof{r}=-\log \pof{r}\) is minimized exactly when \(\pof{r}\) becomes maximal.
Hence, we choose the success probability \(\rho\) to do so: the maximum likelihood solution \(\argmax_p \pof{r \given \rho}\) is \(\rho = \frac{r}{N}\). The Binomial distribution of \(R\) then has its mode, mean, and median at \(r\).
Altogether, after substituting \(\rho = \frac{r}{N}\) and rearranging, we obtain the wanted approximation: \[\begin{align}
\log \binom{N}{r} &\le -r \log \rho - (N - r) \log (1-\rho) \\
& = r \log \frac{N}{r} + (N - r) \log \frac{N}{N-r}.
\end{align}\]
Approximation Error \(\Hof{r}\). The approximation error is just \(\Hof{r}\). We can easily upper-bound it with \(\Hof{r} \le \log N\):
First, \(\Hof{R} \le \log N\) as the uniform distribution with entropy \(\log N\) is the maximum entropy distribution in this case (discrete random variable with finite support).
Second, \(\Hof{R}\) is the expectation over different \(\Hof{R=r'}\). Now given the \(\rho\) we have chosen, \(\pof{r}\) is maximal (and \(\Hof{r}\) is minimal). Hence, \(\Hof{r} \le \log N\) by contradiction: otherwise, we would have \(\log N < \Hof{r} \le \Hof{R}\), yet \(\Hof{R} \le \log N\).
Empirical Evaluation
In Figure 2, below we show an empirical evaluation of our approximation in comparison to the true \(\log\) binomial coefficients for \(N\) up to \(10^5\). We chose \(r \approx \tfrac{N}{2}\).
(a) exact \(\log\) binomials vs our approximation
(b) approximation error vs estimated error
(c) relative approximation error vs estimated error
Figure 2. Comparison between exact \(\log\) binomial and our approximation \(r \log \frac{N}{r} + (N - r) \log \frac{N}{N-r}\). Qualitatively the approxmation error quickly seems to vanish in (a). While the absolute error is slowly increasing in (b), we see that the relative error quickly vanishes in (c). The estimated approximation error \(\log N\) is not tight but still a good predictor. More details can be found in this Google Colab.
Conclusion
We have deduced an approximation of Stirling’s approximation for binomial coefficients and upper-bounded the approximation error with new intuitions. We have done so by using simple probability theory and information theory.
Acknowledgements
We would like to thank Yarin Gal, Tim Rudner, Ravid Shwartz-Ziv, Clare Lyle, Joost van Amersfoort, as well as the members of OATML in general for their feedback.
More from OATML
For more blog posts by OATML in Oxford, check out our group’s blog
https://oatml.cs.ox.ac.uk/blog.html.