STAFF: Speculative Coreset Selection for Task-Specific Fine-tuning

Authors: Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, Yang Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Experiment results show that STAFF outperforms SOTA methods in coreset selection across different pruning rates, improving fine-tuning performance by up to 54.3% compared to the best baseline method and saving up to 70.5% of selection overhead.
Researcher Affiliation Academia 1Xi an Jiaotong University 2University of Massachusetts, Amherst 3Nanyang Technological University {EMAIL,chaoshen@xjtu,EMAIL}.edu.cn EMAIL {EMAIL,yangliu@ntu}.edu.sg
Pseudocode Yes Algorithm 1 STAFF for Coreset Selection
Open Source Code Yes Our code is publicly available at https: //github.com/shiningrain/STAFF. Our implementation and data are publically available1. 1Our code is available at https://github.com/shiningrain/STAFF. To follow the Open Science Policy and support reproducibility, we have released code about our implementations and evaluations. All resources are available in https://github.com/shi ningrain/STAFF.
Open Datasets Yes We evaluate STAFF on three datasets on different downstream tasks, namely, the Bio Instruct dataset (Tran et al., 2024) (biology question-answering), Dialog Sum dataset (Chen et al., 2021) (dialogue summarization), and the Kazakh-English subset of WMT-19 dataset (Barrault et al., 2019) (translation of minority languages).
Dataset Splits Yes In the experiment, we divided each dataset into the training set and the test set according to a ratio of 9:1.
Hardware Specification Yes All fine-tuning experiments are conducted on one NVIDIA RTX A6000 GPU.
Software Dependencies No While the paper mentions software like Lo RA for fine-tuning and a fine-tuning framework, it does not provide specific version numbers for these software components or any other libraries.
Experiment Setup Yes We set fine-tuning budget T in selection to 3 and K to 50. The number of samples used in verification for each bin (bv) is 10. For fine-tuning pre-trained models on three datasets of downstream tasks, we perform a grid search over learning rate {1e 5, 2e 5, 1e 4, 2e 4} and the batch size {2, 4, 8}. We opt for a fixed number of epochs (e.g., 4 epochs) in all experiments. Table 5 provides specific learning rates for each model on different datasets.