Challenging Complexity: Stochastic Acquisition for Efficient Deep Batch Active Learning

“Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning” has been published in the Transactions of Machine Learning Research (TMLR). This paper challenges the status quo in deep active learning for medium-long acquisition batch sizes “More than a few, but not that many.”: 10–1000 acquisitions depending on the dataset. and examines an efficient approach for batch acquisition through simple stochastic extensions of standard acquisition functions.

The research questions that drove this work are:

How can we adapt standard single-sample acquisition functions like BALD and entropy for efficient batch acquisition in deep active learning? And:
Can simple stochastic extensions match the performance of more complex batch acquisition methods?

The paper examines a simple approach to batch acquisition: the introduction of stochastic extensions of single-sample acquisition functions, namely Power, Softmax, and Softrank sampling. The motivation behind this approach is twofold:

The common practice of top-K acquisition is inherently flawed, as it erroneously assumes that acquisition scores are independent.
More complex methods like BatchBALD and BADGE, while addressing this issue, are computationally demanding.

The rest of this blog post presents the result of this paper and additional investigations and experiments. For all the details, please refer to the paper.

Figure 1. Acquisition scores at individual acquisition step $t$ are only a loose proxy for later scores at $t+n$ (here: $t=0$). Specifically, the Spearman rank-correlation between acquisition scores on the zerot-th and $n$’th time-step falls with $n$. While top-K acquisition incorrectly implicitly assumes the rank-correlation remains $1$, stochastic acquisitions do not. Using Monte-Carlo Dropout BNN trained on MNIST at initial 20 points and 73% initial accuracy; score ranks computed over test set.

Figure 2. Performance on Repeated-MNIST with 4 repetitions (5 trials). Up and to the right is better. PowerBALD outperforms (top-K) BALD and BADGE and is on par with BatchBALD. This is despite being orders of magnitude faster. Acquisition size: 10, except for BatchBALD (5).

Power, Softmax, and Softrank Acquisition

Stochastic sampling methods can counter the shortcomings of top-K acquisition while being much cheaper to compute than BatchBALD or BADGE. By adding Gumbel noise to scores or score ranks, a stochastic sampling distribution is induced. We sample a batch without replacement from this distribution by selecting the top-K noised scores (the Gumbel-Top-K trick). Not only is this method surprisingly simple to implement, it also acknowledges the reality that scores will change as new points are acquired.

Specifically, we model the noise distribution of future acquisition scores as the addition of Gumbel-distributed noise $\epsilon_i \sim \mathrm{Gumbel}(0,\beta^{-1})$, and sample a batch without replacement from the induced distribution:

Stochastic Extension	Score Update	Acquisition Distribution
Softmax	$s_i^\tt{Softmax} = s_i + \epsilon_i$	$p(i) \propto \exp \beta s_i$
Power	$s_i^\tt{Power} = \log s_i + \epsilon_i$	$p(i) \propto s_i^\beta$
SoftRank	$s_i^\tt{SoftRank} = -\log r_i + \epsilon_i$	$p(i) \propto r_i^{-\beta}$

where $s_i$ is the acquisition score of sample $i$, $r_i$ is the rank of sample $i$, and $\beta > 0$ is a temperature parameter (coldness). For $\beta \to 0$, the distribution becomes deterministic, and for $\beta \to \infty$, the distribution becomes uniform.

This method is based on the Gumbel-Max trick, as described by Gumbel (1954) and Maddison et al. (2014), and it specifically utilizes the Gumbel-Top-K trick by Kool et al. (2019). To expand on the work of Maddison et al. (2014):

Proposition (Gumbel-Top-K): For scores $s_i$, where $i \in \{1, \ldots, n\}$, and $k \le n$ and $\beta > 0$, if we draw $\epsilon_i \sim \mathrm{Gumbel}(0;\beta^{-1})$ independently, then $\arg \mathrm{top}_k \{s_i + \epsilon_i\}_i$ is an (ordered) sample without replacement from the categorical distribution $\mathrm{Categorical}(\frac{\exp(\beta \, s_i)}{\sum_j \exp(\beta \, s_j)})$.

Applying this proposition to the Softmax, Power, and Softrank acquisition functions, we obtain the distributions in the table above.

Softrank acquisition does not take into account the scores themselves, but only their rank. Softmax and Power acquisition, on the other hand, do take into account the scores, and while Softmax acquisition does not require non-negative scores, Power acquisition does. In particular, a probability of $0$ would prevent a sample from being selected under Power acquisition. This aligns with the semantics of many scoring functions, in particular BALD and Entropy. This is not the case for Softmax acquisition, and even less so for Softrank acquisition.

Experimental Results

The results are encouraging. Stochastic sampling outperforms top-K and often matches and sometimes even outperforms complex methods like BatchBALD and BADGE. Computationally, it is orders of magnitude cheaper than these methods, making it a highly efficient alternative for batch acquisition in active deep learning.

Table 1. Acquisition runtime (in seconds, 5 trials, ± s.d.). The stochastic acquisition methods are as fast as top-batch, here with BALD scores, and orders of magnitude faster than BADGE or BatchBALD. Synthetic pool set with M=10,000 pool points with 10 classes. BatchBALD & BALD: 20 parameter samples. In superscript, mean accuracy over acquisition steps from reference on Repeated-MNIST with 4 repetitions (5 trials).

Batch Size	Top-Batch (BALD) ^80%	Stochastic (PowerBALD) ^90%	BatchBALD ^90%	BADGE ^86%
10	0.2±0.0	0.2±0.0	566.0±17.4	9.2±0.3
100	0.2±0.0	0.2±0.0	5,363.6±95.4	82.1±2.5
500	0.2±0.0	0.2±0.0	29,984.1±598.7	409.3±3.7

Concretely:

on (Repeated-)MNIST, PowerBALD outperforms BADGE and matches BatchBALD despite being orders of magnitude faster (Figure 2);
on EMNIST, PowerBALD outperforms BADGE and BatchBALD. BADGE ran out of memory on the full dataset (Figure 3(a,b));
on MIO-TCD, a large-scale traffic camera dataset, PowerBALD matches the performance of BADGE (Figure 3(c));
on Synbols with minority groups, PowerBALD performs on par with BADGE (Figure 3(d));
on CLINC-150, an NLP dataset, in the paper’s appendix, PowerEntropy outperforms Entropy and matches BADGE.

We find overall that Power acquisition performs best in our experiments among the three stochastic extensions. We generally keep $\beta=1$ fixed but also provide ablations for it in the paper.

(a) EMNIST ‘Balanced’ (5 trials)

(b) EMNIST ‘ByMerge’ (5 trials)

(c) MIO-TCD (3 trials)

(d) Synbols with minority groups (10 trials)

Figure 3. Performance on various datasets. BatchBALD took infeasibly long on these datasets & acquisition sizes. (a) EMNIST ‘Balanced’: On 132k samples, PowerBALD (acq. size 10) outperforms BatchBALD (acq. size 5) and BADGE (acq. size 10). (b) EMNIST ‘ByMerge’: On 814k samples, PowerBALD (acq. size 10) outperforms BatchBALD (acq. size 5). BADGE (not shown) OOM’ed, and BatchBALD took >12 days for 115 acquisitions. (c) MIO-TCD: PowerBALD performs better than BALD and on par with BADGE (all acq. size 100). (d) Synbols with minority groups: PowerBALD performs on par with BADGE (all acq. size 100).

Investigations

But the paper goes beyond this empirical observations. It raises important questions about when and why more complex methods might be beneficial and using top-K as a baseline. Together with important other works on stochastic acquisition strategies, that complement this work, it also raises the question for better stochastical modelling of score dynamics as future research.

In the paper itself, we delve into the assumptions about the underlying score dynamics by examining acquisition scores across acquisitions. We also hypothesize about the conditions under which top-K acquisition is most detrimental to active learning and fundamental limitations of BatchBALD’s empirical estimator. That said, we also argue that later in training, top-K acquisition might be less detrimental and that the acquisition size could be increased with less loss in performance.

Why Gumbel Noise? The selection of the $t$-th point in the acquisition batch is based on the additional information that the remaining $K-k$ points will provide. We thus aim to model the maximum over all possible additional candidate points yet to be selected.

Empirically, acquisition scores resemble a truncated exponential distribution (see also Figure 6(c)), which is well approximated by a Gumbel distribution in the sample limit. This is based on the Extreme Value Theorem (EVT), which states that for independent and identically distributed (i.i.d.) random variables with exponential tails, the maximum value follows the Gumbel distribution.

However, it’s unlikely that the increase in acquisition scores can be perfectly modeled as i.i.d. exponential random variables. But the Gumbel approximation seems to hold empirically even for the hypoexponential distribution, which is a sum of exponential distributions with different rates.

This motivates us to use a Gumbel distribution as a simple model for the increase in acquisition scores. The assumption is that the increase in acquisition scores at each step is exponentially distributed with different rates at each step.

For a more detailed explanation and numerical simulations, please refer to our paper. Zhan et al. (2022) also provide a different analysis for the use of Gumbel noise in the context of active learning.

Acquisition Asymptotics of Bayesian Models. For well-specified and well-defined Bayesian parametric models, the posterior distribution of the model parameters converges to the true parameters as the number of data points increases.

For such models and assuming that the predictions are independent given the model parameters, the total correlation between the predictions decreases as the number of training points increases, as the posterior distribution of the model parameters becomes more concentrated around the true parameters: \[\begin{align} \operatorname{TC}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] \to 0 \quad \text{as} \quad |\mathcal{D}_{\text{train}}| \to \infty. \end{align}\] This can be proved by noting that in the finite data limit, the posterior parameter distribution converges to the true model parameters, and the marginal distribution then factorizes. This means that the predictions become more independent as the number of training points increases and fully independent in the infinite data limit.

The total correlation is defined as: \[\begin{align} &\operatorname{TC}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] \\ &\quad \triangleq \underbrace{\sum_i \operatorname{H}[Y_i | x_i, \mathcal{D}_{\text{train}}]}_{\text{top-K Entropy}} - \underbrace{\operatorname{H}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}]}_{\text{'Batch Entropy'}}. \end{align}\] We can also write the total correlation as difference between top-K BALD and BatchBALD: \[\begin{align} &\operatorname{TC}[Y_1, \ldots, Y_K | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] \\ &\quad = \underbrace{\sum_i \operatorname{I}[Y_i; \Omega | x_i, \mathcal{D}_{\text{train}}]}_{\text{top-K BALD}} - \underbrace{\operatorname{I}[Y_1, \ldots, Y_K; \Omega | x_1, \ldots, x_K, \mathcal{D}_{\text{train}}]}_{\text{BatchBALD}}. \end{align}\] As the total correlation converges to $0$, the top-K BALD term (first term) becomes equal to the BatchBALD term (the second term on the right side), and the same happens for top-K entropy and `BatchEntropy’, which we could similarly define.

Thus, for well-specified and well-defined Bayesian parametric models, the top-K acquisition functions will eventually become equivalent to the BatchBALD and ‘BatchEntropy’ acquisition functions as the number of training points increases. This tells us that top-K acquisition is the most detrimental to active learning in the earlier stages of learning, when the total correlation between the predictions is still high. This is consistent with our empirical results below (‘Increasing Top-K Analysis’).

At the same time, as the number of training points increases and the model parameters concentrate, the expected information gain (BALD) also decreases. The mutual information with a deterministic variable is always 0, and thus: \[\begin{align} \operatorname{I}[Y; \Omega | x, \mathcal{D}_{\text{train}}] \to 0 \quad \text{as} \quad |\mathcal{D}_{\text{train}}| \to \infty. \end{align}\] This asymptotic behavior is a trivial but important result, as it tells us that the expected information gain (BALD) will eventually become uninformative as the number of training points increases and no better than random acquisition, and the important question is: when? Given that we only have noisy estimators, this determines until when active learning is of use compared to random acquisition.

Many different active learning methods that are considered non-Bayesian nevertheless approximate the expected information gain or the expected predictive information, which is an expected total correlation. Hence, the considerations apply to those methods, too.

Finally, we observe that estimators such as BatchBALD, which utilize Monte-Carlo samples for parameter approximations, are inherently limited by the logarithm of the total number $M$ of these samples, $\log M$. This constraint implies that their informativeness can diminish rapidly. Concretely, consider an empirical estimator $\hat{\operatorname{I}}[\cdot;\Omega]$ built using Monte-Carlo samples $\omega_1, \ldots, \omega_M$. This is analogous to computing the exact mutual information $\operatorname{I}[\cdot; \hat{\Omega}]$ with the `empirical’ random variable $\hat{\Omega}$, which uniformly samples from $\omega_1, \ldots, \omega_M$. Given that the discrete mutual information is restricted by the entropy of its terms, we have: \[ \hat{\operatorname{I}}[\cdot;\Omega] = \operatorname{I}[\cdot; \hat{\Omega}] \le \operatorname{H}[\hat{\Omega}] = \log M. \]

For instance, BatchBALD employs a greedy approach to select the $t$-th acquisition sample in its batch. It does this by maximizing the empirical $\hat{\operatorname{I}}[Y ;\Omega \mid x, Y_{t-1}, x_{t-1}, \ldots, Y_1, x_1 \mathcal{D}_{\text{train}}]$ for the subsequent candidate samples denoted by $x$. We can represent this relationship as: \[ \log M \ge \hat{\operatorname{I}}[Y_1, \ldots, Y_K;\Omega \mid x_1, \ldots, x_K, \mathcal{D}_{\text{train}}] = \sum_{i=1}^K \hat{\operatorname{I}}[Y ;\Omega \mid x, Y_y, x_y, \ldots, Y_1, x_1 \mathcal{D}_{\text{train}}]. \] From the above equation, as $K$ grows, the estimated $\hat{\operatorname{I}}[Y_K ;\Omega \mid x_K, Y_{K-1}, x_{K-1}, \ldots, Y_1, x_1 \mathcal{D}_{\text{train}}]$ approaches zero since it is restricted by $\log M$. For a scenario with $M = 100$ parameter samples (leading to $\log_{10} M = 2$), BatchBALD might rapidly lose its informativeness after just two acquisitions in a classification scenario involving 10 categories. This situation arises if the pool set contains a minimum of two highly diverse, that is uncorrelated, data points with maximum disagreement.

(a) $t=0$

(b) $t=100$

Figure 4. Rank correlations for BALD scores on MNIST between the initial scores and later scores of the top- or bottom-scoring 1%, 10% and 100% of test points (smoothed with a size-10 Parzen window). Rank-orders decorrelate faster for the most informative samples and in the early stages of training. The top scorers’ ranks after roughly 40 (100) acquisitions unlike the bottom ones. Later in training, the acquisition scores stay more strongly correlated. This suggests .

Figure 5. Top-K acquisition hurts less later in training (BALD on MNIST). At $t\in\{20, 100\}$ (blue), we keep acquiring samples using the BALD scores from those two steps. At $t=20$ (orange), the model performs well for $\approx20$ acquisitions; at $t=120$ (green), for $\approx50$.

(a) $t=0$

(b) $t=100$

(c) Scores at Step 20 and 100.

Figure 6. AUROCs for BALD scores on MNIST between the initial scores and later scores of the top- or bottom-scoring 1%, 10% and 50% of test points (smoothed with a size-10 Parzen window). AUROC between original points as ‘ground truth’ and later scores as predictors. This is equivalent to the probability that the acquisition score at $n$ for a point in $t=0$’s top or bottom 1%, etc. is larger than points outside. This tells us how likely other points outside the batch have higher acquisition scores. This ignores the ranking of points otherwise. ($t=0$, $t=100$) Points in the top quantiles are superseded by other points in the top quantiles in the later acquisitions to a large degree. This is much more pronounced early in the training than later. The bottom quantiles are more stable. (Scores at Step 20 and 100.) The overall score distributions at steps $t=0, 100$ are visualized and the relevant top and bottom quantiles are marked.

Rank Correlations Across Acquisitions. We made the following (implicit) assumptions to argue why top-K acquisition is flawed:

The acquisition scores $s_t$ at step $t$ are a proxy for scores $s_{t'}$ at step $t' > t$;
The larger $t'-t$ is, the worse a proxy $s_t$ is for $s_t'$;
This effect is the largest for the most informative points.

We demonstrate these empirically by examining the Spearman rank correlation between scores during acquisition. Specifically, we train a model for $n$ steps using BALD as single-point acquisition function. We compare the rank order at each step to the starting rank order at step $t$. To denoise the rankings across $n$, we smooth the rank correlations with a Parzen window of size 10 and to reduce the effect of noise to the rank order, we round all scores to 2 decimal places. This especially removes unimportant rank changes for points with low scores around 0.

Figure 1 shows that acquisition scores become less correlated as more points are acquired. Figure 4(a) shows this in more detail for the top and bottom 1%, 10% or 100% of scorers of the test set across acquisitions starting at step $t=0$ for a model initialised with 20 points. The top-10% scoring points quickly become uncorrelated across acquisitions and even become anti-correlated. In contrast, the points overall correlate well over time (although they have a much weaker training signal on average). This result supports all three of our hypotheses.

At the same time, we see that as training progresses, and we converge towards the best model, the order of scores becomes more stable across acquisitions. In Figure 4(b) the model begins with 120 points ($t=100$), rather than 20 ($t=0$). Here, the most informative points are less likely to change their rank—even the top-1% ranks do not become anti-correlated, only decorrelated. Thus, we hypothesise that further in training, we might be able to choose larger $K$.

We do not display the correlations for bottom 1% of scorers as their scores are close to 0 throughout training and thus noisy and uninformative, see also Figure 6(c) for this.

Overall, this analysis can be confounded by noisy samples and swaps in rank order between samples with similar scores that are not meaningful in the sense that they would not influence batch acquisition.

Thus, to provide a different analysis, we also consider the more direct question in Figure 6 of how likely other samples have higher acquisition scores at $t+n$ than the top samples from $t$ for different quantiles (1%, 10%, 50%) of the test set. As a sanity check, we also examine the bottom quantiles. This is equivalent to computing the AUROC between the original points as ‘ground truth’ and later scores as predictors: for acquisition step n, the AUROC is \[P(S^{t, \text{top/bottom } p \%}_{n} > S^{t, \text{bottom/top } 1 - p \%}_{n})\] with $S^{t, \text{top/bottom } p \%}_{n} = \{s_{t+n, i} : s_{t, i} \in \text{top/bottom $p\%$ of } \{s_t, j\}\}$. Specifically, we set up a binary classification with the top or bottom 1%, 10% or 50% of the test set as positive and the rest as negative. This again helps us quantify how much the scores meaningfully change across acquisition steps. These results match the previous ones and provide another validation for the mentioned assumptions.

Increasing Top-$K$ Analysis. Another way to investigate the effect of top-$K$ selection is to freeze the acquisition scores during training and then continue single-point `active learning’ as if those were the correct scores. Comparing this to the performance of regular active learning with updated single-point scores allows us to examine how well earlier scores perform as proxies for later scores. We perform this toy experiment on MNIST in Figure 5, showing that freezing scores early on greatly harms performance while doing it later has only a small effect. For frozen scores at a training set size of 20 (73% accuracy, $t=0$), the accuracy matches single-acquisition BALD up to a training set size of roughly 40 (dashed orange lines) before diverging to a lower level. But when freezing the scores of a more accurate model, at a training set size of 120 labels (93% accuracy, $t=100$), selecting the next fifty points according to those frozen scores performs indistinguishably from step-by-step acquisition (dashed green lines). This result shows that top-$K$ acquisition hurts less later in training but can negatively affect performance at the beginning of training.

These observations lead us to ask whether we could dynamically change the acquisition size: with smaller acquisition batches at the beginning and larger ones towards the end of active learning. We leave the exploration of this for future work.

Conclusion

In conclusion, our paper explores simple stochastic extensions of standard acquisition functions. These extensions are not only more efficient, but they often match and sometimes even outperform the performance of more complex batch acquisition methods across a variety of datasets and settings.

Acknowledgements

We would like to thank the members of OATML in general for their continued feedback. And a special thanks in particular to the anonymous TMLR reviewers, who provided incredibly helpful reviews, and our action editor.

More from OATML

For more blog posts by OATML in Oxford, check out the group's blog https://oatml.cs.ox.ac.uk/blog.html.