Our paper “Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities”
It connects several contemporary “SotA” data subset selection methods (i.e. active learning & active sampling) in deep learning to information(-theoretic) quantities from Bayesian deep learning: BALD (expected information gain)
Our paper also provides a gentle tutorial to Fisher information and its approximations using Hessians in deep neural networks and focuses on how to compute various information-theoretic quantities using it. We also show how similarity matrices fit into this. Similarity matrices are often used by recent methods such as SIMILAR, PRISM above, for example.
This paper is the first (theoretical) part of a hopefully two-part work: in a follow-up project, we will look at empirical validation and answering some research questions which we pose throughout the paper.
Outline. We first explain what active learning and active sampling are about. Then, we introduce a unified notation for Fisher information, which we think is quite nice and hopefully useful, before looking at approximations for common information quantities and how they relate to existing recent and literature.
Active learning and active sampling (under the umbrella of data subset selection) are very important topics in deep learning because they promise to do more with less: active learning reduces the amount of labeled data that we need to annotate by focusing on annotating the most “informative” samples; and active sampling reduces the amount of data we need to train by sub-selecting from within labeled samples. Thus, active learning is done without knowing the labels, while active sampling has access to the labels. The two terms are often aptly referred to as data subset selection.
Over the years, many papers have been published on this, also in deep learning and Bayesian deep learning. Obviously, this is not a new problem, and we can trace the origins of determining the most informative samples back to Lindley (1956)
Our work explicitly connects “non-Bayesian” methods to these original Bayesian works and, we hope, shows that some intuitions and fuzzy definitions of “informativeness” are approximations of the same quantities.
Our paper presents a unified notation for information quantities, loss gradients, and Hessians, which directly connects Fisher information to information quantities and allows for expressive symbolic manipulation.
Probabilistic Model. We use a Bayesian perspective for this exposition. What makes it Bayesian is that we also use a distribution for the parameters \(\W\). Concretely, we assume the following: \[\pof{\y, \w \mid \x, \w} = \pof{\y \mid \x, \w} \, \pof{\y \mid \x}.\] Finally, to make a prediction, we marginalize over the parameters (Bayesian model average): \[\pof{\y \mid \x} = \E{\pof{\w}}{\pof{\y \mid \x, \w}}.\]
Notation for Information Quantities. A central quantity in information-theory is the information content: \(-\log p\) of the probability p of an event: \[\xICof{p} := -\log p.\] We commonly use this information content as minimization objective in deep learning (as negative log likelihood or via the cross-entropy).
The entropy is just the expectation over the information content: \[\Hof{X} = \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{\xICof{\pof{x}}} = \E{\pof{x}}{-\log \pof{x}},\] where we use the following convention (introduced in “A Practical & Unified Notation for Information-Theoretic Quantities in ML” (Kirsch et al, 2021)
If we fix a \(\wstar\), then we have:
Note: the observed information is sometimes defined with opposite sign in prior literature, but our definition here makes the connection between the two clearer: Fisher information is just observed information without knowledge of the label \(\y\).
Generally, we can take the properties and rules of the entropy and use the linearity of differentation to transfer them to the Fisher information and observed information. Thus, both Fisher information and observed information as additive. For example: \[\begin{aligned} &\Hof{\wstar \given \y, \x} = \Hof{\y \given \x, \wstar} + \Hof{w} - \Hof{\y \given \x}\\ \implies &\HofHessian{\wstar \given \y, \x} = \HofHessian{\y \given \x, \wstar} + \HofHessian{w} - \HofHessian{\y \given \x} \\ \iff &\HofHessian{\wstar \given \y, \x} = \HofHessian{\y \given \x, \wstar} + \HofHessian{w}, \end{aligned}\] where we can drop \(\HofHessian{\y \given \x}\) because it does not depend on \(\w\) and the Hessian (and Jacobian) are both zero. This shows that the Fisher information and observed information only depend on the likelihood and prior.
The benefit of this notation is that we can use intuitions for information theory to operate on the Fisher information and observed information, while it also makes clear how these quantities are connected to the entropy (as second-order derivatives).
We can use the Fisher information to approximate the entropy:When we use a Gaussian approximation for the posterior distribution \(\pof{\w \given \y, \x}\) around a fixed \(w^*\), we have the following relationship between entropy and Hessian: \[\Hof{\W} \approx -\frac{1}{2} \log \det \HofHessian{\wstar} + C_k,\] where \(C_k\) is a constant which depends only on the dimensionality \(k\) of \(\W\) (\(\frac{k}{2} \log 2 \, \pi , e\)).
Note: when \(w^*\) is a (global) optima, this is also known as the Laplace approximation.Computing second-order derivatives of neural networks is usually not feasible because of the memory and computational requirements. Instead, approximations are used.
We will use our notation to explain and introduce the generalized Gauss-Newton approximation and discuss last-layer approximations.
Note that to compute the Fisher information, we need to take an expectation. This can be expensive — and particularly when considering the batch acquisition case where we might need to iterate over many \(\y\)s. For some models, however, we can find simple solutions for the Fisher information (and the observed information).
However, there are models for which the Fisher information and observed information is independent of the specific \(\y\):
Generalized Linear Model. A generalized linear model (GLM) is a model \(\pof{y \given \logits=\encoder{x ; \w}}\) such that \(\log \pof{y \given \logits} = \logits^T T(y) - A(\logits) + \log h(y)\) is a distribution of the exponential family, independent of \(\w\), and \(\encoder{x ; \w} = \w^T \, x\) is linear in the parameters \(\w\).
For such a GLM, we have: \[\HofHessian{\Y \given \x, \wstar} = \HofHessian{\y \given \x, \wstar},\] where \(\y\) can be any target.If we use this equality for other models where it might not hold, we essentially apply the generalized Gauss-Newton approximation:
Generalized Gauss-Newton Approximation. In the case of an exponential family but not a GLM, the equality above is often used as an approximation for the observed information—we simply use the respective Fisher information as an approximation of the observed information: \[\HofHessian{\y \given \x, \wstar} \approx \HofHessian{\Y \given \x, \wstar}.\] This is known as Generalized Gauss-Newton (GGN) approximation (Kunstner et al., 2019
The GGN approximation and last-layer approaches feature heavily in the literature to make computing these quantities more tractable as they reduce computational requirements and memory usage.
Using this, we can approximate information quantities using the Fisher information and observed information. For the mutual information, we can use the definition of the mutual information as \(\MIof{A; B} = \Hof{A} - \Hof{A \given B}\).
Expected Information Gain (EIG). For example, the expected information gain \(\MIof{\W ; \Y \given x} = \Hof{\W} - \Hof{\W \given \Y, \x}\) can thus be approximated around some \(\wstar\) as: \[ \begin{aligned} \MIof{\W ; \Y \given x} &\approx -\frac{1}{2} \log \det \HofHessian{\wstar} + \frac{1}{2} \log \det \HofHessian{\wstar \given \Y, \x} \\ &= \frac{1}{2} \log \det (\HofHessian{\Y \given \x, \wstar} + \HofHessian{ \wstar} ) -\frac{1}{2} \log \det \HofHessian{\wstar} \\ &= \frac{1}{2} \log \det ((\HofHessian{\Y \given \x, \wstar} + \HofHessian{ \wstar}) \HofHessian{\wstar}^{-1}) \\ &= \frac{1}{2} \log \det (\HofHessian{\Y \given \x, \wstar} \HofHessian{\wstar}^{-1} + Id) \\ &\le \frac{1}{2} \tr (\HofHessian{\Y \given \x, \wstar} \HofHessian{\wstar}^{-1}). \end{aligned} \] The last step follows from the inequality \(\log \det (X + Id) \le \tr X\). Note that the first step is abridged as we need to make some assumptions and approximations (which we look at in the paper), but overall this is roughly what is done in the literature.
Information Quantities. The EIG is just one of the information quantities that are used for active learning and active sampling. The following taxonomy is from the paper, which also explains the details:
Instead of computing the Fisher information, other literature often uses simlarity matrices. When the similarity matrices are based on the inner product of loss gradients, we find that the respective log determinant terms match the ones above by using the matrix-determinant lemma: \[\det (A B + M) = \det M \det (Id + B M^{-1} A).\] Similarity matrices can be written as a matrix product (similar to how we can compute a covariance matrix using the data matrix) and the Fisher information is just the tranposed product of that (in essence). Again the details are in our paper.
So far, we have included a prior \(\pof{\w}\), but a lot of related literature is not Bayesian by itself. It uses the maximum likelihood estimate (MLE) or regularization, which could be interpreted as using a MAP estimate.
We can see that for an uninformative prior, we have \(\HofHessian{\w} \to 0\) (and \(\Hof{\W} \to \infty\)). We analyze this in the paper and draw connections to:
In particular, we see that BADGE approximates the expected information gain while BAIT approximates the expected predictive information gain, while different acquisition functions using similarity matrices in SIMILAR and PRISM approximate different information quantities (depending on the acquisition function).
We want to highlight three insights from our paper in this blog post:
Taking a step back, we have seen that a Bayesian perspective using information quantities connects seemingly disparate literature. Although Bayesian methods are often seen as separate from (non-Bayesian) active learning and active sampling, the sometimes fuzzy notion of “informativeness” expressed through various different objectives in non-Bayesian settings collapses to the same couple of information quantities, which were, in principle, already known by Lindley (1956)
This was a whirlwind tour through a parts of the paper. We pose several research questions based on the unified perspective and provide (hopefully) neat insights. Our paper also provides some initial empirical evaluations in the appendix, but we will keep more of that for a follow-up paper. While the proposed notation might seem like a gimmick, notation is important to structure ideas, and we hope it proves useful for others as well.
I would like to thank Yarin Gal, as well as the members of OATML in general for their continued feedback. Special thanks also to the anonymous TMLR reviewers, who provided incredibly helpful reviews.
For more blog posts by OATML in Oxford, check out our group's blog https://oatml.cs.ox.ac.uk/blog.html.