Optimizing Data Collection for Machine Learning
Authors: Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs. We evaluate LOC over classification, segmentation, and detection tasks to show, on average, approximately a 2 reduction in the chances of failing to meet performance targets, versus estimation baselines. |
| Researcher Affiliation | Collaboration | Rafid Mahmood1,2 EMAIL James Lucas1 EMAIL Jose M. Alvarez1 EMAIL Sanja Fidler1,3,4 EMAIL Marc T. Law1 EMAIL 1 NVIDIA 2 University of Ottawa, Ottawa, Canada 3 University of Toronto, Toronto, Canada 4 Vector Institute, Toronto, Canada |
| Pseudocode | Yes | Algorithm 1 Na ıve Estimation of the Data Requirement Algorithm 2 Estimating the Data Requirement Distribution F(q) Algorithm 3 Optimal Data Collection via LOC |
| Open Source Code | No | The paper does not provide a direct link to a source-code repository or an explicit statement about releasing the code for the described methodology. It mentions a license for the paper itself, but not for the code. |
| Open Datasets | Yes | We explore three tasks for K = 1. First, we consider classification on CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009)... We explore semantic segmentation using Deeplabv3 (Chen et al., 2018) on BDD100K (Yu et al., 2020), as well as Bird s-Eye-View (BEV) segmentation on nu Scenes (Caesar et al., 2020)... Finally, we explore 2-D object detection on PASCAL VOC (Everingham et al., 2007, 2012). |
| Dataset Splits | Yes | We initialize with q0 = 10% of the full data set (we use 20% for VOC). In data collection, we create five subsets containing 2%, 4%, , 10% of the training data, five subsets containing 12%, 14%, , 20% of the training data, and eight subsets containing 30%, 40%, , 100% of the data. We use the original data set split from Yu et al. (2020) with 7, 000 and 1, 000 data points in the train and validation sets respectively. For the labeled set D1, we create subsets with 5%, 10%, 15%, 20%, 40%, 60%, 80%, 100% of the data. For the unlabeled set D2, we create subsets with 0%, 10%, 25%, 50%, 100% of the data. |
| Hardware Specification | Yes | All models were implemented using Py Torch and trained on machines with up to eight NVIDIA V100 GPU cards. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify a version number for this or any other software dependency. |
| Experiment Setup | Yes | We provide a summary of parameters in Table 6. Parameter Setting: Optimizer GD with Momentum (β = 0.9), Adam (β0, β1 = 0.9, 0.999), Learning rate 0.005, ..., 500, Number of bootstrap samples B 500, KDE Bandwidth 20000, ..., 20000000 for Image Net, GMM number of clusters 4, ..., 10. |