reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AgentRefine: Enhancing Agent Generalization through Refinement Tuning

Authors: Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma GongQue, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, Weiran Xu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments in terms of five agent evaluation tasks demonstrate that Agent Refine significantly outperforms state-of-the-art agent-tuning work. The key findings are summarized as follows: While existing agent-tuning work improve held-in agent performance, they hardly generalize the ability to new agent tasks. In contrast, our Agent Refine does not depend on memorizing training trajectories but learns to self-refine its mistakes and explore more actions and reasonable paths. Our experiments demonstrate that agent-tuning on normal trajectories performs poorly to the small perturbation of agent environments, like the action description. Refinement tuning exhibits greater robustness to environmental changes. Further analysis indicates the diversity of agent environments and thoughts contributes to refinement tuning.
Researcher Affiliation	Collaboration	Dayuan Fu1 , Keqing He2 , Yejie Wang1 , Wentao Hong1, Zhuoma Gongque1, Weihao Zeng1, Wei Wang2, Jingang Wang2, Xunliang Cai2, Weiran Xu1 1Beijing University of Posts and Telecommunications, Beijing, China 2Meituan, Beijing, China
Pseudocode	Yes	Algorithm 1 presents the Trajectory Verification pipeline. Algorithm 1 Trajectory Verification 1: Input: Available Actions, Trajectory, Verified Trajectory 2: # The Verified Trajectory will be set to an empty list if this is the first verification of the persona or the last generation s fault is error num 1 3: Initialize: error num=0 4: if JSON format verification does not pass then 5: JSON format verification does not pass 6: end if 7: for turn in Trajectory do 8: if JSON keys in turn do not match the requirement then 9: return Verified Trajectory and the signal 10: end if 11: if Player s turn then 12: # We only check the action when DM considers it correct. 13: if not next DM turn shows error signal then 14: if Player s action doesn t match any actioni (and its parameter) in Available Actions then 15: return Verified Trajectory and the signal 16: end if 17: end if 18: end if 19: if DM s turn then 20: if Error signal then 21: error num += 1 22: end if 23: if This is the last turn then 24: # The last turn should not have any error 25: if Error signal then 26: return Verified Trajectory and the signal 27: end if 28: # The last turn should finish the task 29: if No Task Succeed in Observation then 30: return Verified Trajectory and the signal 31: end if 32: # We need at least 2 error-refine turns. 33: if error num < 1 then 34: return Verified Trajectory and the signal 35: end if 36: end if 37: end if 38: Verified Trajectory ← Verified Trajectory + turn 39: end for
Open Source Code	Yes	Equal contribution. Emails: EMAIL, Code: https://github.com/Fu-Dayuan/Agent Refine Corresponding authors.
Open Datasets	Yes	Tasks We select 5 tasks: Sci World (Wang et al., 2022), Alfworld (Shridhar et al., 2020), Baby AI (Chevalier-Boisvert et al., 2018), PDDL (Vallati et al., 2015), and Jericho (Hausknecht et al., 2020), all of them are testing models decision-making ability. ... We choose a reasoning task Hotpot QA (Yang et al., 2018) in the ablation experiment.
Dataset Splits	Yes	We choose a reasoning task Hotpot QA (Yang et al., 2018) in the ablation experiment. We use Wikipedia search in LATS (Zhou et al., 2023) as the environment, randomly sample 300 questions from Hotpot QA, and test the exact match (EM) and F1 score of those methods. ... Table 6 presents the number of test data and domains in the 5 tasks. These number calculates the Held-out Task score. Specifically, Held out Taskscore = (Baby AIscore 112 + Sci Worldscore 90 + PDDLscore 60 + Jerichoscore 20)/282 task Alfworld Baby AI Sci World PDDL Jericho #num 134 112 90 60 20 Domain Science Experiment Household Tasks Robot Exploration Strategy Games Long Text Games Table 6: tasks statistic in Agent Board. #num refers to the number of data for testing.
Hardware Specification	No	For all models, the learning rate is 5e-6 with a cosine learning rate scheduler and no warm-up steps. The batch size is 64. The max length is 8192 for 7/8b models and 4096 for 70b models due to limited storage for Deep Speed (Rasley et al., 2020) usage. The paper does not specify any particular GPU or CPU models used, nor other specific hardware details beyond general memory implications from model sizes.
Software Dependencies	No	We use gpt-4o-2024-05-13 to generate the script and trajectory. We use LLa MA-Factory (Zheng et al., 2024) to train our models. The paper mentions software by name (LLa MA-Factory, Deep Speed) and model versions (gpt-4o-2024-05-13, LLa MA3, Mistral-v0.3) but does not provide specific version numbers for software libraries or frameworks like Python or PyTorch, which are critical for reproduction.
Experiment Setup	Yes	For all models, the learning rate is 5e-6 with a cosine learning rate scheduler and no warm-up steps. The batch size is 64. The max length is 8192 for 7/8b models and 4096 for 70b models due to limited storage for Deep Speed (Rasley et al., 2020) usage. Aligned with Agent-FLAN, we choose Agent Refine with 32000 data for the default training setting. Aligned with Agent Gen (Hu et al., 2024), we train our model for 10 epochs and select the checkpoint with the best average results to report. We also modified the LLa MA-Factory s SFT loss to Equation 1. Other settings are aligned with LLa MA-Factory s default settings.