reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Data Efficiency via Curating LLM-Driven Rating Systems

Authors: Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao, Wei Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms fullscale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These ﬁndings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reafﬁrming that more can be less. The code is available at: https://github.com/UCSC-REAL/DS2.
Researcher Affiliation	Collaboration	1University of California, Santa Cruz 2Center for Advanced AI, Accenture 3BIAI, ZJUT & D5Data.ai 4The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL
Pseudocode	Yes	The complete pseudo-code is available in Algorithm 1. ... Algorithm 1 Proposed Data Selection Pipeline DS2
Open Source Code	Yes	The code is available at: https://github.com/UCSC-REAL/DS2.
Open Datasets	Yes	The data pool consists of ﬁve instruct-ﬁnetuning datasets: Flan_v2 (Longpre et al., 2023), Open Assistant 1 (Köpf et al., 2024), Wizard LM (Xu et al., 2023a), Dolly (Databricks, 2023), and Stanford Alpaca (Taori et al., 2023).
Dataset Splits	Yes	We adopt ﬁve Open LLM Leaderboard tasks as our benchmark for evaluation, including MMLU (Hendrycks et al., 2020), Truthful QA (Lin et al., 2021), GSM (Cobbe et al., 2021), BBH (Suzgun et al., 2022), Tydi QA (Clark et al., 2020). For MMLU, Truthful QA, GSM, and BBH datasets, we use Exact Match (EM) as the criteria. For Tydi QA, we consider using the 1-shot F1 score. ... We evaluate ﬁne-tuned models on a randomly selected subset with 200 samples from the original test set (1319 samples). ... we select 40 examples from each BBH sub-task. ... We prompt the ﬁne-tuned models to generate answers for 818 Truthful QA questions ... For each language, we select 100 examples.
Hardware Specification	Yes	In our experiments, we ﬁne-tune 7B and 8B models using four or eight NVIDIA Tesla A100 GPUs. ... The wall-clock running time is measured on a Microsoft Azure 8*A100 (80GB) GPUs cluster.
Software Dependencies	No	The paper mentions applying Lora (Hu et al., 2021) as a method, but does not provide specific software names with version numbers for libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	for all experiments based on 7B/8B models, we consistently apply Lora (Hu et al., 2021) with a rank-size of 64 and a scaling factor of 16. Then, we set the overall batch size to 128, the learning rate at 1e-4, the training epochs to 5, the dropout rate to 0.1, and a warm ratio of 0.03. The default maximum input length is 2048 tokens for all models.