reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness

Authors: Yunxiao Zhang, Guanming Xiong, Haochen Li, Wen Zhao

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the performance of our method. Our method achieves competitive results on the Hotpot QA and Web Shop and datasets, requiring 75% and 50% less data, respectively, while outperforming existing methods.
Researcher Affiliation	Collaboration	Yunxiao Zhang1 , Guanming Xiong1 , Haochen Li2 and Wen Zhao1 1Peking University 201.AI EMAIL, gm EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes steps in regular paragraph text without structured formatting like pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	Hotpot QA [Yang et al., 2018] is a multi-hop question-answering benchmark... Web Shop [Yao et al., 2022] is a simulated online shopping environment...
Dataset Splits	Yes	For Hotpot QA, we use the first 10,000 training questions as the data pool and randomly select 500 dev questions. For Web Shop, we use 8,500 instructions as the data pool and another 500 instructions for evaluation. For each dataset, we selected 30 samples with the lowest GE score for guideline updating, and then annotated 800 samples for fine-tuning.
Hardware Specification	Yes	For fine-tuning, we choose LLAMA-3.1-8B-Instruct (L-8B) and Mistral-7B-Instruct-v0.3 (M-7B), training for 4 epochs with a learning rate of 5e-6 using 8 NVIDIA 80GB A100 GPUs.
Software Dependencies	No	The paper mentions using the Open AI GPT4-o API (gpt-4o-2024-08-06) and specific pre-trained models (LLAMA-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3) but does not provide versions for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup	Yes	For all inference, we set temperature=0.7, top p=0.95, max length=512. For fine-tuning, we choose LLAMA-3.1-8B-Instruct (L-8B) and Mistral-7B-Instruct-v0.3 (M-7B), training for 4 epochs with a learning rate of 5e-6 using 8 NVIDIA 80GB A100 GPUs.