reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimizing Data Collection for Machine Learning

Authors: Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We signiﬁcantly reduce the risks of failing to meet desired performance targets on several classiﬁcation, segmentation, and detection tasks, while maintaining low total collection costs. We evaluate LOC over classiﬁcation, segmentation, and detection tasks to show, on average, approximately a 2 reduction in the chances of failing to meet performance targets, versus estimation baselines.
Researcher Affiliation	Collaboration	Raﬁd Mahmood1,2 EMAIL James Lucas1 EMAIL Jose M. Alvarez1 EMAIL Sanja Fidler1,3,4 EMAIL Marc T. Law1 EMAIL 1 NVIDIA 2 University of Ottawa, Ottawa, Canada 3 University of Toronto, Toronto, Canada 4 Vector Institute, Toronto, Canada
Pseudocode	Yes	Algorithm 1 Na ıve Estimation of the Data Requirement Algorithm 2 Estimating the Data Requirement Distribution F(q) Algorithm 3 Optimal Data Collection via LOC
Open Source Code	No	The paper does not provide a direct link to a source-code repository or an explicit statement about releasing the code for the described methodology. It mentions a license for the paper itself, but not for the code.
Open Datasets	Yes	We explore three tasks for K = 1. First, we consider classiﬁcation on CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009)... We explore semantic segmentation using Deeplabv3 (Chen et al., 2018) on BDD100K (Yu et al., 2020), as well as Bird s-Eye-View (BEV) segmentation on nu Scenes (Caesar et al., 2020)... Finally, we explore 2-D object detection on PASCAL VOC (Everingham et al., 2007, 2012).
Dataset Splits	Yes	We initialize with q0 = 10% of the full data set (we use 20% for VOC). In data collection, we create ﬁve subsets containing 2%, 4%, , 10% of the training data, ﬁve subsets containing 12%, 14%, , 20% of the training data, and eight subsets containing 30%, 40%, , 100% of the data. We use the original data set split from Yu et al. (2020) with 7, 000 and 1, 000 data points in the train and validation sets respectively. For the labeled set D1, we create subsets with 5%, 10%, 15%, 20%, 40%, 60%, 80%, 100% of the data. For the unlabeled set D2, we create subsets with 0%, 10%, 25%, 50%, 100% of the data.
Hardware Specification	Yes	All models were implemented using Py Torch and trained on machines with up to eight NVIDIA V100 GPU cards.
Software Dependencies	No	The paper mentions 'Py Torch' but does not specify a version number for this or any other software dependency.
Experiment Setup	Yes	We provide a summary of parameters in Table 6. Parameter Setting: Optimizer GD with Momentum (β = 0.9), Adam (β0, β1 = 0.9, 0.999), Learning rate 0.005, ..., 500, Number of bootstrap samples B 500, KDE Bandwidth 20000, ..., 20000000 for Image Net, GMM number of clusters 4, ..., 10.