reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distilling the Knowledge in Data Pruning

Authors: Emanuel Ben Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On Image Net for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight.
Researcher Affiliation	Industry	1Amazon. Correspondence to: Emanuel Ben Baruch <EMAIL>.
Pseudocode	No	The paper includes mathematical equations and theorems but does not present any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository. It mentions using and expanding the 'Deep Core framework' but does not provide access to their specific modifications.
Open Datasets	Yes	Datasets. We perform experiments on four classification datasets: CIFAR-10 (Krizhevsky et al., a) with 10 classes, consists of 50,000 training samples and 10,000 testing samples; SVHN (Netzer et al., 2011) with 10 classes, consists of 73,257 training samples and 26,032 testing samples; CIFAR-100 (Krizhevsky et al., b) with 100 classes, consists of 50,000 training samples and 10,000 testing samples; and Image Net (Russakovsky et al., 2015) with 1,000 classes, consists of 1.2M training samples and 50K testing samples.
Dataset Splits	Yes	Datasets. We perform experiments on four classification datasets: CIFAR-10 (Krizhevsky et al., a) with 10 classes, consists of 50,000 training samples and 10,000 testing samples; SVHN (Netzer et al., 2011) with 10 classes, consists of 73,257 training samples and 26,032 testing samples; CIFAR-100 (Krizhevsky et al., b) with 100 classes, consists of 50,000 training samples and 10,000 testing samples; and Image Net (Russakovsky et al., 2015) with 1,000 classes, consists of 1.2M training samples and 50K testing samples.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. It only mentions the model architectures used (Res Net-32, Res Net-50).
Software Dependencies	No	The paper mentions using 'SGD with Momentum' for optimization and a 'modified version of the Rep Distiller framework (Tian et al., 2019)' but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	The models are trained for 240 epochs with a batch size of 64. For the optimization process we use SGD with learning rate 0.05, momentum value of 0.9 and weight decay of 5e 4. The learning rate is decreased by a factor of 10 on the 150th, 180th and 210th epochs. To conduct the distillation experiments on Image Net... The models are trained for 240 epochs with a batch size of 128. We utilize SGD with learning rate 0.1, momentum value of 0.9 and weight decay of 5e 4. The learning rate is gradually decayed during training using a cosine-annealing scheduler (Loshchilov & Hutter, 2017). In all of our distillation experiments we use τ = 4 as the temperature for the KD s soft predictions computation in Equation (1).