Distilling the Knowledge in Data Pruning
Authors: Emanuel Ben Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On Image Net for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. |
| Researcher Affiliation | Industry | 1Amazon. Correspondence to: Emanuel Ben Baruch <EMAIL>. |
| Pseudocode | No | The paper includes mathematical equations and theorems but does not present any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository. It mentions using and expanding the 'Deep Core framework' but does not provide access to their specific modifications. |
| Open Datasets | Yes | Datasets. We perform experiments on four classification datasets: CIFAR-10 (Krizhevsky et al., a) with 10 classes, consists of 50,000 training samples and 10,000 testing samples; SVHN (Netzer et al., 2011) with 10 classes, consists of 73,257 training samples and 26,032 testing samples; CIFAR-100 (Krizhevsky et al., b) with 100 classes, consists of 50,000 training samples and 10,000 testing samples; and Image Net (Russakovsky et al., 2015) with 1,000 classes, consists of 1.2M training samples and 50K testing samples. |
| Dataset Splits | Yes | Datasets. We perform experiments on four classification datasets: CIFAR-10 (Krizhevsky et al., a) with 10 classes, consists of 50,000 training samples and 10,000 testing samples; SVHN (Netzer et al., 2011) with 10 classes, consists of 73,257 training samples and 26,032 testing samples; CIFAR-100 (Krizhevsky et al., b) with 100 classes, consists of 50,000 training samples and 10,000 testing samples; and Image Net (Russakovsky et al., 2015) with 1,000 classes, consists of 1.2M training samples and 50K testing samples. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. It only mentions the model architectures used (Res Net-32, Res Net-50). |
| Software Dependencies | No | The paper mentions using 'SGD with Momentum' for optimization and a 'modified version of the Rep Distiller framework (Tian et al., 2019)' but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | The models are trained for 240 epochs with a batch size of 64. For the optimization process we use SGD with learning rate 0.05, momentum value of 0.9 and weight decay of 5e 4. The learning rate is decreased by a factor of 10 on the 150th, 180th and 210th epochs. To conduct the distillation experiments on Image Net... The models are trained for 240 epochs with a batch size of 128. We utilize SGD with learning rate 0.1, momentum value of 0.9 and weight decay of 5e 4. The learning rate is gradually decayed during training using a cosine-annealing scheduler (Loshchilov & Hutter, 2017). In all of our distillation experiments we use τ = 4 as the temperature for the KD s soft predictions computation in Equation (1). |