Adaptive Data Collection for Robust Learning Across Multiple Distributions

Authors: Chengbo Zang, Mehmet Kerem Turkcan, Gil Zussman, Zoran Kostic, Javad Ghaderi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods. In this section, we present the experimental results of the three algorithms described in Section 3, compared to other state-of-the-art AL (active learning) algorithms.
Researcher Affiliation Academia 1Department of Electrical Engineering, Columbia University, New York NY, USA 2Department of Civil Engineering, Columbia University, New York NY, USA. Correspondence to: Chengbo Zang <EMAIL>, Javad Ghaderi <EMAIL>.
Pseudocode Yes Algorithm 1 General Framework of Online Optimization with Adaptive Data Collection Require: Total training rounds T, batch size M, randomly initialized θ1 1: X0 2: for t = 1, 2, . . . , T do 3: kt SELECT(θt, Xt 1) 4: Bt {X1, . . . , XM Dkt} 5: Xt Xt 1 S Bt 6: θt+1 UPDATE(θt, Xt, kt) 7: end for
Open Source Code No The paper does not provide a specific link to source code or explicitly state that the code is being released in supplementary materials or upon publication.
Open Datasets Yes We perform image classification on the CIFAR10 dataset (Krizhevsky et al., 2009) with a budget of 10,000 images, where every class is a data source. We also report the results on the MNIST dataset (Lecun et al., 1998) to test different optimizer configurations and get more insight into the distribution of collected samples from different classes under different algorithms. We perform object detection on the PASCAL VOC2012 dataset (Everingham et al.) with a budget of 3,000 images. We perform a simple Visual Question Answering (VQA) task under a budget of 1,000 question-answer pairs from the VQAv2 dataset (Antol et al., 2015).
Dataset Splits Yes Each algorithm executes 1,000 rounds and collects a batch of 8 samples every 4 rounds under a total budget of 2,000 training images. All AL algorithms are given an initial labeled pool of 1,000 samples (10% of budget), and proceeds to collect 3,000 samples in each episode from the remaining dataset for three episodes. For the multi-class object detection task... The MDN algorithm is given an initial labeled pool of 600 samples (20% of budget), and proceeds to collect 800 samples per episode for three episodes. We fix the total budget of 3,000 samples and change the number of samples allocated during each episode.
Hardware Specification No The acknowledgements mention "compute resources from NVIDIA Academic Grant Edge AI for Equitable and Safe Intersections in Urban Metropolises", but no specific GPU or CPU models are provided for the experiments.
Software Dependencies No The paper mentions optimizers like Adam and SGD, and model architectures like VGG16, SSD300, YOLOv8, and Smol VLM-256M-Base, but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We also observe from Figure 1(b) that the Adam optimizer with cosine-annealing learning rate scheduler (LRS) and L2 regularization (Reg) provides the smoothest trajectory, which we adopt for the following experiments. For the optimization step (Line 6 of Algorithm 1), we consider Online Gradient Descent (OGD) (Hazan, 2016). Recall that kt is the data source selected for the current round t... where ηt := 1/(2Lsqrt(t)) is the learning rate. CIFAR10: We execute our algorithms for 20,000 rounds and collect a batch of 32 samples every 60 rounds until reaching the budget. PASCAL VOC2012: We execute our algorithms by pretraining for 10,000 rounds (freezing the backbone), collecting a batch of 8 samples every 50 rounds. Then we finetune for 20,000 rounds, collecting a batch of 8 samples 100 rounds until reaching the budget.