“Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning” has been published in the Transactions of Machine Learning Research (TMLR). This paper challenges the status quo in deep active learning for medium-long acquisition batch sizes
The research questions that drove this work are:
The paper examines a simple approach to batch acquisition: the introduction of stochastic extensions of single-sample acquisition functions, namely Power, Softmax, and Softrank sampling. The motivation behind this approach is twofold:
The common practice of top-K acquisition is inherently flawed, as it erroneously assumes that acquisition scores are independent.
More complex methods like BatchBALD
The rest of this blog post presents the result of this paper and additional investigations and experiments. For all the details, please refer to the paper.
Stochastic sampling methods can counter the shortcomings of top-K acquisition while being much cheaper to compute than BatchBALD or BADGE. By adding Gumbel noise to scores or score ranks, a stochastic sampling distribution is induced. We sample a batch without replacement from this distribution by selecting the top-K noised scores (the Gumbel-Top-K trick
Specifically, we model the noise distribution of future acquisition scores as the addition of Gumbel-distributed noise \(\epsilon_i \sim \mathrm{Gumbel}(0,\beta^{-1})\), and sample a batch without replacement from the induced distribution:
Stochastic Extension | Score Update | Acquisition Distribution |
---|---|---|
Softmax | \(s_i^\tt{Softmax} = s_i + \epsilon_i\) | \(p(i) \propto \exp \beta s_i\) |
Power | \(s_i^\tt{Power} = \log s_i + \epsilon_i\) | \(p(i) \propto s_i^\beta\) |
SoftRank | \(s_i^\tt{SoftRank} = -\log r_i + \epsilon_i\) | \(p(i) \propto r_i^{-\beta}\) |
where \(s_i\) is the acquisition score of sample \(i\), \(r_i\) is the rank of sample \(i\), and \(\beta > 0\) is a temperature parameter (coldness). For \(\beta \to 0\), the distribution becomes deterministic, and for \(\beta \to \infty\), the distribution becomes uniform.
This method is based on the Gumbel-Max trick, as described by Gumbel (1954)
Proposition (Gumbel-Top-K): For scores \(s_i\), where \(i \in \{1, \ldots, n\}\), and \(k \le n\) and \(\beta > 0\), if we draw \(\epsilon_i \sim \mathrm{Gumbel}(0;\beta^{-1})\) independently, then \(\arg \mathrm{top}_k \{s_i + \epsilon_i\}_i\) is an (ordered) sample without replacement from the categorical distribution \(\mathrm{Categorical}(\frac{\exp(\beta \, s_i)}{\sum_j \exp(\beta \, s_j)})\).
Applying this proposition to the Softmax, Power, and Softrank acquisition functions, we obtain the distributions in the table above.
Softrank acquisition does not take into account the scores themselves, but only their rank. Softmax and Power acquisition, on the other hand, do take into account the scores, and while Softmax acquisition does not require non-negative scores, Power acquisition does. In particular, a probability of \(0\) would prevent a sample from being selected under Power acquisition. This aligns with the semantics of many scoring functions, in particular BALD and Entropy
The results are encouraging. Stochastic sampling outperforms top-K and often matches and sometimes even outperforms complex methods like BatchBALD and BADGE. Computationally, it is orders of magnitude cheaper than these methods, making it a highly efficient alternative for batch acquisition in active deep learning.
Batch Size | Top-Batch (BALD) 80% | Stochastic (PowerBALD) 90% | BatchBALD 90% | BADGE 86% |
---|---|---|---|---|
10 | 0.2±0.0 | 0.2±0.0 | 566.0±17.4 | 9.2±0.3 |
100 | 0.2±0.0 | 0.2±0.0 | 5,363.6±95.4 | 82.1±2.5 |
500 | 0.2±0.0 | 0.2±0.0 | 29,984.1±598.7 | 409.3±3.7 |
Concretely:
We find overall that Power acquisition performs best in our experiments among the three stochastic extensions. We generally keep \(\beta=1\) fixed but also provide ablations for it in the paper.
But the paper goes beyond this empirical observations. It raises important questions about when and why more complex methods might be beneficial and using top-K as a baseline. Together with important other works on stochastic acquisition strategies
In the paper itself, we delve into the assumptions about the underlying score dynamics by examining acquisition scores across acquisitions. We also hypothesize about the conditions under which top-K acquisition is most detrimental to active learning and fundamental limitations of BatchBALD’s empirical estimator. That said, we also argue that later in training, top-K acquisition might be less detrimental and that the acquisition size could be increased with less loss in performance.
Why Gumbel Noise? The selection of the \(t\)-th point in the acquisition batch is based on the additional information that the remaining \(K-k\) points will provide. We thus aim to model the maximum over all possible additional candidate points yet to be selected.
Empirically, acquisition scores resemble a truncated exponential distribution (see also Figure 6(c)), which is well approximated by a Gumbel distribution in the sample limit. This is based on the Extreme Value Theorem (EVT)
However, it’s unlikely that the increase in acquisition scores can be perfectly modeled as i.i.d. exponential random variables. But the Gumbel approximation seems to hold empirically even for the hypoexponential distribution, which is a sum of exponential distributions with different rates.
This motivates us to use a Gumbel distribution as a simple model for the increase in acquisition scores. The assumption is that the increase in acquisition scores at each step is exponentially distributed with different rates at each step.
For a more detailed explanation and numerical simulations, please refer to our paper. Zhan et al. (2022)
Acquisition Asymptotics of Bayesian Models. For well-specified and well-defined Bayesian parametric models, the posterior distribution of the model parameters converges to the true parameters as the number of data points increases
For such models and assuming that the predictions are independent given the model parameters, the total correlation between the predictions decreases as the number of training points increases, as the posterior distribution of the model parameters becomes more concentrated around the true parameters: \[\begin{align} \operatorname{TC}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] \to 0 \quad \text{as} \quad |\mathcal{D}_{\text{train}}| \to \infty. \end{align}\] This can be proved by noting that in the finite data limit, the posterior parameter distribution converges to the true model parameters, and the marginal distribution then factorizes. This means that the predictions become more independent as the number of training points increases and fully independent in the infinite data limit.
The total correlation is defined as: \[\begin{align} &\operatorname{TC}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] \\ &\quad \triangleq \underbrace{\sum_i \operatorname{H}[Y_i | x_i, \mathcal{D}_{\text{train}}]}_{\text{top-K Entropy}} - \underbrace{\operatorname{H}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}]}_{\text{'Batch Entropy'}}. \end{align}\] We can also write the total correlation as difference between top-K BALD and BatchBALD: \[\begin{align} &\operatorname{TC}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] \\ &\quad = \underbrace{\sum_i \operatorname{I}[Y_i; \Omega | x_i, \mathcal{D}_{\text{train}}]}_{\text{top-K BALD}} - \underbrace{\operatorname{I}[Y_1, \ldots, Y_K; \Omega | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}]}_{\text{BatchBALD}}. \end{align}\] As the total correlation converges to \(0\), the top-K BALD term (first term) becomes equal to the BatchBALD term (the second term on the right side), and the same happens for top-K entropy and `BatchEntropy’, which we could similarly define.
Thus, for well-specified and well-defined Bayesian parametric models, the top-K acquisition functions will eventually become equivalent to the BatchBALD and ‘BatchEntropy’ acquisition functions as the number of training points increases. This tells us that top-K acquisition is the most detrimental to active learning in the earlier stages of learning, when the total correlation between the predictions is still high. This is consistent with our empirical results below (‘Increasing Top-K Analysis’).
At the same time, as the number of training points increases and the model parameters concentrate, the expected information gain (BALD) also decreases. The mutual information with a deterministic variable is always 0, and thus: \[\begin{align} \operatorname{I}[Y; \Omega | x, \mathcal{D}_{\text{train}}] \to 0 \quad \text{as} \quad |\mathcal{D}_{\text{train}}| \to \infty. \end{align}\] This asymptotic behavior is a trivial but important result, as it tells us that the expected information gain (BALD) will eventually become uninformative as the number of training points increases and no better than random acquisition, and the important question is: when? Given that we only have noisy estimators, this determines until when active learning is of use compared to random acquisition.
Many different active learning methods that are considered non-Bayesian nevertheless approximate the expected information gain or the expected predictive information
Finally, we observe that estimators such as BatchBALD, which utilize Monte-Carlo samples for parameter approximations, are inherently limited by the logarithm of the total number \(M\) of these samples, \(\log M\). This constraint implies that their informativeness can diminish rapidly. Concretely, consider an empirical estimator \(\hat{\operatorname{I}}[\cdot;\Omega]\) built using Monte-Carlo samples \(\omega_1, \ldots, \omega_M\). This is analogous to computing the exact mutual information \(\operatorname{I}[\cdot; \hat{\Omega}]\) with the `empirical’ random variable \(\hat{\Omega}\), which uniformly samples from \(\omega_1, \ldots, \omega_M\). Given that the discrete mutual information is restricted by the entropy of its terms, we have: \[ \hat{\operatorname{I}}[\cdot;\Omega] = \operatorname{I}[\cdot; \hat{\Omega}] \le \operatorname{H}[\hat{\Omega}] = \log M. \]
For instance, BatchBALD employs a greedy approach to select the \(t\)-th acquisition sample in its batch. It does this by maximizing the empirical \(\hat{\operatorname{I}}[Y ;\Omega \mid x, Y_{t-1}, x_{t-1}, \ldots, Y_1, x_1 \mathcal{D}_{\text{train}}]\) for the subsequent candidate samples denoted by \(x\). We can represent this relationship as: \[ \log M \ge \hat{\operatorname{I}}[Y_1, \ldots, Y_K;\Omega \mid x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] = \sum_{i=1}^K \hat{\operatorname{I}}[Y ;\Omega \mid x, Y_y, x_y, \ldots, Y_1, x_1 \mathcal{D}_{\text{train}}]. \] From the above equation, as \(K\) grows, the estimated \(\hat{\operatorname{I}}[Y_K ;\Omega \mid x_K, Y_{K-1}, x_{K-1}, \ldots, Y_1, x_1 \mathcal{D}_{\text{train}}]\) approaches zero since it is restricted by \(\log M\). For a scenario with \(M = 100\) parameter samples (leading to \(\log_{10} M = 2\)), BatchBALD might rapidly lose its informativeness after just two acquisitions in a classification scenario involving 10 categories. This situation arises if the pool set contains a minimum of two highly diverse, that is uncorrelated, data points with maximum disagreement.
Rank Correlations Across Acquisitions. We made the following (implicit) assumptions to argue why top-K acquisition is flawed:
The acquisition scores \(s_t\) at step \(t\) are a proxy for scores \(s_{t'}\) at step \(t' > t\);
The larger \(t'-t\) is, the worse a proxy \(s_t\) is for \(s_t'\);
This effect is the largest for the most informative points.
We demonstrate these empirically by examining the Spearman rank correlation between scores during acquisition. Specifically, we train a model for \(n\) steps using BALD as single-point acquisition function. We compare the rank order at each step to the starting rank order at step \(t\). To denoise the rankings across \(n\), we smooth the rank correlations with a Parzen window of size 10 and to reduce the effect of noise to the rank order, we round all scores to 2 decimal places. This especially removes unimportant rank changes for points with low scores around 0.
Figure 1 shows that acquisition scores become less correlated as more points are acquired. Figure 4(a) shows this in more detail for the top and bottom 1%, 10% or 100% of scorers of the test set across acquisitions starting at step \(t=0\) for a model initialised with 20 points. The top-10% scoring points quickly become uncorrelated across acquisitions and even become anti-correlated. In contrast, the points overall correlate well over time (although they have a much weaker training signal on average). This result supports all three of our hypotheses.
At the same time, we see that as training progresses, and we converge towards the best model, the order of scores becomes more stable across acquisitions. In Figure 4(b) the model begins with 120 points (\(t=100\)), rather than 20 (\(t=0\)). Here, the most informative points are less likely to change their rank—even the top-1% ranks do not become anti-correlated, only decorrelated. Thus, we hypothesise that further in training, we might be able to choose larger \(K\).
We do not display the correlations for bottom 1% of scorers as their scores are close to 0 throughout training and thus noisy and uninformative, see also Figure 6(c) for this.
Overall, this analysis can be confounded by noisy samples and swaps in rank order between samples with similar scores that are not meaningful in the sense that they would not influence batch acquisition.
Thus, to provide a different analysis, we also consider the more direct question in Figure 6 of how likely other samples have higher acquisition scores at \(t+n\) than the top samples from \(t\) for different quantiles (1%, 10%, 50%) of the test set. As a sanity check, we also examine the bottom quantiles. This is equivalent to computing the AUROC between the original points as ‘ground truth’ and later scores as predictors: for acquisition step n, the AUROC is \[P(S^{t, \text{top/bottom } p \%}_{n} > S^{t, \text{bottom/top } 1 - p \%}_{n})\] with \(S^{t, \text{top/bottom } p \%}_{n} = \{s_{t+n, i} : s_{t, i} \in \text{top/bottom $p\%$ of } \{s_t, j\}\}\). Specifically, we set up a binary classification with the top or bottom 1%, 10% or 50% of the test set as positive and the rest as negative. This again helps us quantify how much the scores meaningfully change across acquisition steps. These results match the previous ones and provide another validation for the mentioned assumptions.
Increasing Top-\(K\) Analysis. Another way to investigate the effect of top-\(K\) selection is to freeze the acquisition scores during training and then continue single-point `active learning’ as if those were the correct scores. Comparing this to the performance of regular active learning with updated single-point scores allows us to examine how well earlier scores perform as proxies for later scores. We perform this toy experiment on MNIST in Figure 5, showing that freezing scores early on greatly harms performance while doing it later has only a small effect. For frozen scores at a training set size of 20 (73% accuracy, \(t=0\)), the accuracy matches single-acquisition BALD up to a training set size of roughly 40 (dashed orange lines) before diverging to a lower level. But when freezing the scores of a more accurate model, at a training set size of 120 labels (93% accuracy, \(t=100\)), selecting the next fifty points according to those frozen scores performs indistinguishably from step-by-step acquisition (dashed green lines). This result shows that top-\(K\) acquisition hurts less later in training but can negatively affect performance at the beginning of training.
These observations lead us to ask whether we could dynamically change the acquisition size: with smaller acquisition batches at the beginning and larger ones towards the end of active learning. We leave the exploration of this for future work.
In conclusion, our paper explores simple stochastic extensions of standard acquisition functions. These extensions are not only more efficient, but they often match and sometimes even outperform the performance of more complex batch acquisition methods across a variety of datasets and settings.
We would like to thank the members of OATML in general for their continued feedback. And a special thanks in particular to the anonymous TMLR reviewers, who provided incredibly helpful reviews, and our action editor.
For more blog posts by OATML in Oxford, check out the group's blog https://oatml.cs.ox.ac.uk/blog.html.