-
CoLoR-Filter: Selecting Data for Language Model Pre-training
Pre-training language models requires massive amounts of data, yet not all data contributes equally to model performance. As models and datasets continue to grow in size, identifying and selecting the most valuable training examples has become increasingly critical.
-
Challenging Complexity: Stochastic Acquisition for Efficient Deep Batch Active Learning
“Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning” has been published in the Transactions of Machine Learning Research (TMLR). This paper challenges the status quo in deep active learning for medium-long acquisition batch sizes
“More than a few, but not that many.”: 10–1000 acquisitions depending on the dataset. and examines an efficient approach for batch acquisition through simple stochastic extensions of standard acquisition functions. -
Black-Box Batch Active Learning for Regression (B3AL)
Training machine learning models can require massive labeled datasets. Active learning aims to reduce labeling costs by selecting the most informative samples for labeling. But how well do prediction-focused black-box techniques compare to parameter-focused white-box methods?
-
Unifying Approaches in Active Learning and Active Sampling
Our paper “Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities”
was recently published in TMLR. -
Assessing Generalization via Disagreement
Our paper “A Note on ‘Assessing Generalization of SGD via Disagreement’”
was published in TMLR this week and serves both as a short reproduction and review note. It engages with the claims in “Assessing Generalization of SGD via Disagreement” by Jiang et al. (2022) , which received an ICLR 2022 spotlight. We would like to thank the authors for constructively engaging with our note on OpenReview.