reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Authors: Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, Huaimin Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments across diverse domains, including web navigation and interactive question answering. The results demonstrate that Q-value models can clearly distinguish actions that lead to success or failure, enhancing decision-making for LLM Agents via select effective actions at each step. Additionally, task-dependent Q-value models are generalizable across different LLM agents, allowing us to utilize inexpensive LLM agents to collect training data while enhancing the decision-making of more advanced LLM agents in a plug-and-play manner. Furthermore, our method complements the design of effective prompting strategies, and integrating it with these strategies can further improve performance.
Researcher Affiliation	Academia	1National University of Defense Technology, Changsha, China 2State Key Laboratory of Complex & Critical Software Environment 3Hunan Institute of Advanced Technology, Changsha, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the MCTS process and DPO algorithm using natural language and mathematical formulas (e.g., Equation 4 for UCT, Equation 5 for V(st) update, Equation 7 for Ltrajectory, Equation 8 for Lstep). However, it does not present a distinct, structured block explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code	No	The paper does not contain an unambiguous statement of code release or a direct link to a source-code repository for the methodology described. It mentions "Phi-1.5 1 huggingface.co/microsoft/phi-1 5" but this refers to a base model used, not the authors' implementation code.
Open Datasets	Yes	We evaluate our method on two tasks across different domains: Web Shop (Yao et al. 2022) and Hot Pot QA (Yang et al. 2018).
Dataset Splits	Yes	For Hot Pot QA, we randomly select 1000 questions for training, 100 for validation, and 100 for testing. For Web Shop, we follow the data split described in Song et al. (2024), which consists of 1824 instructions for training, 100 questions for validation, and 100 questions for testing.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A40 48G GPU, except when implementing fine-tuning-based methods, which require two NVIDIA A100 80G GPUs.
Software Dependencies	No	The paper mentions several LLM models used (Phi-3-mini-4k-instruct, Llama-3.1-8B-Instruct, GPT-4o-mini, GPT-4-turbo, Phi-1.5) but does not provide specific version numbers for any underlying software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The maximum context length is set to 4096. We include 3-shot in-context examples in the instruction prompt for both tasks. The maximum number of decision-making steps is set to 10 for Web Shop and 7 for Hot Pot QA. The number of candidate actions is set to n = 5, unless otherwise specified, for both our method and Bo N. In our previous experiments, we set the MCTS iteration to m = 30 by default.