Exploring Learning Complexity for Efficient Downstream Dataset Pruning

Authors: Wenyu Jiang, Zhenlong Liu, Zejian Xie, Songxin Zhang, Bingyi Jing, Hongxin Wei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35 while establishing state-of-the-art performance with Flex Rand.
Researcher Affiliation Academia 1Department of Statistics and Data Science, Southern University of Science and Technology 2State Key Laboratory for Novel Software Technology, Nanjing University
Pseudocode No The paper describes methods and formulas but does not include a clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code Yes The code is available in the supplementary material.
Open Datasets Yes Therefore, we choose diverse downstream datasets from 5 domains (Islam et al., 2021) to construct the large-scale benchmark, including CXRB102, Deep Weeds (Olsen et al., 2019), DTD (Cimpoi et al., 2014), FGVCAircraft (Maji et al., 2013), and Sketch (Eitz et al., 2012). For hyperparameter tuning, we split 20% as the validation set. ... Alpaca Cleaned (Taori et al., 2023) and Dolly & HH-RLHF3.
Dataset Splits Yes For hyperparameter tuning, we split 20% as the validation set. ... We prune the downstream datasets at 9 pruning ratios, ranging from 10% to 90%, for a thorough verification and comparison. For example, we keep 10% of each category in the original dataset when the pruning ratio is 10%.
Hardware Specification Yes The code is based on Py Torch and all the experiments run on NVIDIA L40.
Software Dependencies No To ensure reliable reproduction, we have run the compared baselines using the Deep Core (Guo et al., 2022) library. The code is based on Py Torch (Paszke et al., 2019). ... Regarding the pre-trained models and instruction datasets, we use the Hugging Face library. Fine-tuning is based on the PEFT library, and evaluation is based on the LM-Eval library.
Experiment Setup Yes Fine-tuning. We sequentially attach a linear layer on top of the pre-trained encoder for the downstream image classification. Then, the above classifier is fully trained on the pruned dataset for 50 epochs using SGD with a momentum of 0.9, a weight decay of 1e-5, and a batch size of 128. The initial learning rate is 1e-3 and decays by a factor of 10 at the 25th and 37th epochs. ... We fine-tune the base model for 3 epochs using SGD with a batch size of 32, a momentum of 0.9, a learning rate of 7e-6 scheduled by cosine function, and a weight decay of 0.01. Note that the learning rate increases linearly at the warmup stage (the first 100 steps).