DELIFT: Data Efficient Language model Instruction Fine-Tuning
Authors: Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevsky
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across multiple datasets and model scales show DELIFT reduces fine-tuning data requirements by up to 70% without compromising performance, consistently outperforming existing methods by up to 26% in effectiveness and efficiency. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign, 2IBM Research |
| Pseudocode | Yes | Algorithm 1 Greedy Maximization for Submodular Function |
| Open Source Code | Yes | Our complete code base is publicly available at https://github.com/agarwalishika/delift, enabling further exploration and replication. |
| Open Datasets | Yes | Datasets. We group the datasets by the primary goal of fine-tuning, ensuring a clear mapping from the data to the corresponding submodular objective. In particular: 1. Instruction Tuning: Mix-Instruct (Jiang et al., 2023), P3 (Sanh et al., 2021). Both aim to enhance general instruction-following behavior, featuring a variety of task prompts and user requests. 2. Task-Specific Fine Tuning: Hotpot QA (Yang et al., 2018) aligned with MMLU (Hendrycks et al., 2021), Mix-Instruct aligned with MT-Bench (Zheng et al., 2023), and Mix-Instruct aligned with GSM-8k (Cobbe et al., 2021). These pairings allow us to extract only the most relevant samples from a large corpus to improve performance on a specific target benchmark. 3. Continual Fine-Tuning: (a) SQu AD (Rajpurkar et al., 2016) paired with Hotpot QA to inject more complex, multi-hop reasoning data after simpler QA, and (b) a proprietary IBM/Government domain query rewriting dataset.1 |
| Dataset Splits | Yes | In all cases, we fixed an approximate budget of 30% for subset selection unless otherwise noted, striking a balance between data efficiency and coverage. Beyond consistently using 30% of the data in our main experiments, we investigated how varying the subset size influences performance. We tested budgets ranging from as little as 5% up to 50% of the original training set (in increments of 10%). |
| Hardware Specification | No | A part of this work used the Delta system at the National Center for Supercomputing Applications through allocation CIS240550 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. |
| Software Dependencies | No | The paper mentions LLMs like Llama-3.2-3B, Mistral-7B-v0.1, Qwen2-72B-Instruct, Phi-3-mini-128k-instruct, and fine-tuning methods like ICL and QLoRA, but does not provide specific version numbers for any software libraries or packages used for implementation. |
| Experiment Setup | Yes | Consistent hyperparameter settings were maintained across all experiments to ensure reproducibility: Submodular Function: Utilized Facility Location (FL), Facility Location Mutual Information (FLMI), or Facility Location Conditional Gain (FLCG) based on the use case. Utility Metric Scaling Factor: Set η = 1 for FLMI and ν = 1 for FLCG. Budget (% of Data): Fixed at 30% for all subset selection experiments. Optimization Algorithm: Employed greedy maximization with a stopping criterion based on the budget. Distance Metric: Used length-normalized L2 norm. Teacher Forcing Technique: Applied during utility metric computation to ensure reliable prediction accuracy measurement. |