research blog

musings about machine learning

CoLoR-Filter: Selecting Data for Language Model Pre-training

March 29, 2025

Pre-training language models requires massive amounts of data, yet not all data contributes equally to model performance. As models and datasets continue to grow in size, identifying and selecting the most valuable training examples has become increasingly critical.
Challenging Complexity: Stochastic Acquisition for Efficient Deep Batch Active Learning

October 1, 2023

“Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning” has been published in the Transactions of Machine Learning Research (TMLR). This paper challenges the status quo in deep active learning for medium-long acquisition batch sizes “More than a few, but not that many.”: 10–1000 acquisitions depending on the dataset. and examines an efficient approach for batch acquisition through simple stochastic extensions of standard acquisition functions.
Black-Box Batch Active Learning for Regression (B³AL)

July 31, 2023

Training machine learning models can require massive labeled datasets. Active learning aims to reduce labeling costs by selecting the most informative samples for labeling. But how well do prediction-focused black-box techniques compare to parameter-focused white-box methods?
Unifying Approaches in Active Learning and Active Sampling

December 30, 2022

Our paper “Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities” was recently published in TMLR.
Assessing Generalization via Disagreement

November 7, 2022

Our paper “A Note on ‘Assessing Generalization of SGD via Disagreement’” was published in TMLR this week and serves both as a short reproduction and review note. It engages with the claims in “Assessing Generalization of SGD via Disagreement” by Jiang et al. (2022) , which received an ICLR 2022 spotlight. We would like to thank the authors for constructively engaging with our note on OpenReview.

research blog

musings about machine learning

CoLoR-Filter: Selecting Data for Language Model Pre-training

Challenging Complexity: Stochastic Acquisition for Efficient Deep Batch Active Learning

Black-Box Batch Active Learning for Regression (B3AL)

Unifying Approaches in Active Learning and Active Sampling

Assessing Generalization via Disagreement

Black-Box Batch Active Learning for Regression (B³AL)