Improving Data Efficiency via Curating LLM-Driven Rating Systems
Authors: Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao, Wei Wei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms fullscale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that more can be less. The code is available at: https://github.com/UCSC-REAL/DS2. |
| Researcher Affiliation | Collaboration | 1University of California, Santa Cruz 2Center for Advanced AI, Accenture 3BIAI, ZJUT & D5Data.ai 4The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL |
| Pseudocode | Yes | The complete pseudo-code is available in Algorithm 1. ... Algorithm 1 Proposed Data Selection Pipeline DS2 |
| Open Source Code | Yes | The code is available at: https://github.com/UCSC-REAL/DS2. |
| Open Datasets | Yes | The data pool consists of five instruct-finetuning datasets: Flan_v2 (Longpre et al., 2023), Open Assistant 1 (Köpf et al., 2024), Wizard LM (Xu et al., 2023a), Dolly (Databricks, 2023), and Stanford Alpaca (Taori et al., 2023). |
| Dataset Splits | Yes | We adopt five Open LLM Leaderboard tasks as our benchmark for evaluation, including MMLU (Hendrycks et al., 2020), Truthful QA (Lin et al., 2021), GSM (Cobbe et al., 2021), BBH (Suzgun et al., 2022), Tydi QA (Clark et al., 2020). For MMLU, Truthful QA, GSM, and BBH datasets, we use Exact Match (EM) as the criteria. For Tydi QA, we consider using the 1-shot F1 score. ... We evaluate fine-tuned models on a randomly selected subset with 200 samples from the original test set (1319 samples). ... we select 40 examples from each BBH sub-task. ... We prompt the fine-tuned models to generate answers for 818 Truthful QA questions ... For each language, we select 100 examples. |
| Hardware Specification | Yes | In our experiments, we fine-tune 7B and 8B models using four or eight NVIDIA Tesla A100 GPUs. ... The wall-clock running time is measured on a Microsoft Azure 8*A100 (80GB) GPUs cluster. |
| Software Dependencies | No | The paper mentions applying Lora (Hu et al., 2021) as a method, but does not provide specific software names with version numbers for libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | for all experiments based on 7B/8B models, we consistently apply Lora (Hu et al., 2021) with a rank-size of 64 and a scaling factor of 16. Then, we set the overall batch size to 128, the learning rate at 1e-4, the training epochs to 5, the dropout rate to 0.1, and a warm ratio of 0.03. The default maximum input length is 2048 tokens for all models. |