reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DELIFT: Data Efficient Language model Instruction Fine-Tuning

Authors: Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevsky

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across multiple datasets and model scales show DELIFT reduces fine-tuning data requirements by up to 70% without compromising performance, consistently outperforming existing methods by up to 26% in effectiveness and efficiency.
Researcher Affiliation	Collaboration	1University of Illinois Urbana-Champaign, 2IBM Research
Pseudocode	Yes	Algorithm 1 Greedy Maximization for Submodular Function
Open Source Code	Yes	Our complete code base is publicly available at https://github.com/agarwalishika/delift, enabling further exploration and replication.
Open Datasets	Yes	Datasets. We group the datasets by the primary goal of fine-tuning, ensuring a clear mapping from the data to the corresponding submodular objective. In particular: 1. Instruction Tuning: Mix-Instruct (Jiang et al., 2023), P3 (Sanh et al., 2021). Both aim to enhance general instruction-following behavior, featuring a variety of task prompts and user requests. 2. Task-Specific Fine Tuning: Hotpot QA (Yang et al., 2018) aligned with MMLU (Hendrycks et al., 2021), Mix-Instruct aligned with MT-Bench (Zheng et al., 2023), and Mix-Instruct aligned with GSM-8k (Cobbe et al., 2021). These pairings allow us to extract only the most relevant samples from a large corpus to improve performance on a specific target benchmark. 3. Continual Fine-Tuning: (a) SQu AD (Rajpurkar et al., 2016) paired with Hotpot QA to inject more complex, multi-hop reasoning data after simpler QA, and (b) a proprietary IBM/Government domain query rewriting dataset.1
Dataset Splits	Yes	In all cases, we fixed an approximate budget of 30% for subset selection unless otherwise noted, striking a balance between data efficiency and coverage. Beyond consistently using 30% of the data in our main experiments, we investigated how varying the subset size influences performance. We tested budgets ranging from as little as 5% up to 50% of the original training set (in increments of 10%).
Hardware Specification	No	A part of this work used the Delta system at the National Center for Supercomputing Applications through allocation CIS240550 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Software Dependencies	No	The paper mentions LLMs like Llama-3.2-3B, Mistral-7B-v0.1, Qwen2-72B-Instruct, Phi-3-mini-128k-instruct, and fine-tuning methods like ICL and QLoRA, but does not provide specific version numbers for any software libraries or packages used for implementation.
Experiment Setup	Yes	Consistent hyperparameter settings were maintained across all experiments to ensure reproducibility: Submodular Function: Utilized Facility Location (FL), Facility Location Mutual Information (FLMI), or Facility Location Conditional Gain (FLCG) based on the use case. Utility Metric Scaling Factor: Set η = 1 for FLMI and ν = 1 for FLCG. Budget (% of Data): Fixed at 30% for all subset selection experiments. Optimization Algorithm: Employed greedy maximization with a stopping criterion based on the budget. Distance Metric: Used length-normalized L2 norm. Teacher Forcing Technique: Applied during utility metric computation to ensure reliable prediction accuracy measurement.