Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Authors: Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on Image Net-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.
Researcher Affiliation Academia 1Kim Jaechul Graduate School of AI, KAIST, Seoul, South Korea. Correspondence to: Chulhee Yun <EMAIL>.
Pseudocode Yes The detailed algorithm for our proposed pruning method is provided in Algorithm 1, Appendix C.
Open Source Code Yes Implementation is available at github/dual-pruning.
Open Datasets Yes Experiments conducted on CIFAR and Image Net datasets under various learning scenarios verify the superiority of our method. Specifically, on Image Net-1k, our method reduces the time cost to 66% compared to previous methods while achieving a SOTA performance, 60% test accuracy at the pruning ratio of 90%. On the CIFAR datasets, as illustrated in Figure 1, our method reduces the time cost to just 15% while maintaining SOTA performance. Especially, our proposed method shows a notable performance when the dataset contains noise.
Dataset Splits No The paper mentions using CIFAR and ImageNet datasets and reports test accuracy, implying standard splits are used. However, it does not explicitly specify the proportions or methodology for training/validation/test splits (e.g., "80/10/10 split" or "standard train/test split from [Author et al., 2020]"). The text describes batch sizes for training the models on these datasets but not how the datasets themselves were divided for evaluation purposes beyond implicitly using a test set.
Hardware Specification Yes All experiments were conducted using an NVIDIA A6000 GPU.
Software Dependencies No The paper mentions using specific model architectures (ResNet-18, ResNet-34, VGG-16) and optimizers (SGD) along with a cosine annealing scheduler. However, it does not specify the versions of any programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries/solvers that would be necessary to replicate the experiment environment.
Experiment Setup Yes Hyperparameters For training CIFAR-10 and CIFAR-100, we train Res Net-18 for 200 epochs with a batch size of 128. SGD optimizer with momentum of 0.9 and weight decay of 0.0005 is used. The learning rate is initialized as 0.1 and decays with the cosine annealing scheduler. As Zhang et al. (2024) show that smaller batch size boosts performance at high pruning rates, we also halved the batch size for 80% pruning, and for 90% we reduced it to one-fourth. For Image Net-1k, Res Net-34 is trained for 90 epochs with a batch size of 256 across all pruning ratios. An SGD optimizer with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1 is used, combined with a cosine annealing scheduler. For calculating DUAL score, we need three parameters T, J, and c D, each means score computation epoch, the length of sliding window, and hyperparameter regarding the train dataset. We fix J as 10 for all experiments. We use (T, J, c D) for each dataset as followings. For CIFAR-10, we use (30, 10, 5.5), for CIFAR-100, (30, 10, 4), and for Image Net-1K, (60, 10, 11). We first roughly assign the term c D based on the size of initial dataset and by considering the relative difficulty of each, we set c D for CIFAR-100 smaller than that of CIFAR-10. For the Image Net-1K dataset, which contains 1,281,167 images, the size of the initial dataset is large enough that we do not need to set c D to a small value to in order to intentionally sample easier samples. Also, note that we fix the value of C of Beta distribution at 15 across all experiments.