Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models
Authors: Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, Huaimin Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments across diverse domains, including web navigation and interactive question answering. The results demonstrate that Q-value models can clearly distinguish actions that lead to success or failure, enhancing decision-making for LLM Agents via select effective actions at each step. Additionally, task-dependent Q-value models are generalizable across different LLM agents, allowing us to utilize inexpensive LLM agents to collect training data while enhancing the decision-making of more advanced LLM agents in a plug-and-play manner. Furthermore, our method complements the design of effective prompting strategies, and integrating it with these strategies can further improve performance. |
| Researcher Affiliation | Academia | 1National University of Defense Technology, Changsha, China 2State Key Laboratory of Complex & Critical Software Environment 3Hunan Institute of Advanced Technology, Changsha, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the MCTS process and DPO algorithm using natural language and mathematical formulas (e.g., Equation 4 for UCT, Equation 5 for V(st) update, Equation 7 for Ltrajectory, Equation 8 for Lstep). However, it does not present a distinct, structured block explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | No | The paper does not contain an unambiguous statement of code release or a direct link to a source-code repository for the methodology described. It mentions "Phi-1.5 1 huggingface.co/microsoft/phi-1 5" but this refers to a base model used, not the authors' implementation code. |
| Open Datasets | Yes | We evaluate our method on two tasks across different domains: Web Shop (Yao et al. 2022) and Hot Pot QA (Yang et al. 2018). |
| Dataset Splits | Yes | For Hot Pot QA, we randomly select 1000 questions for training, 100 for validation, and 100 for testing. For Web Shop, we follow the data split described in Song et al. (2024), which consists of 1824 instructions for training, 100 questions for validation, and 100 questions for testing. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA A40 48G GPU, except when implementing fine-tuning-based methods, which require two NVIDIA A100 80G GPUs. |
| Software Dependencies | No | The paper mentions several LLM models used (Phi-3-mini-4k-instruct, Llama-3.1-8B-Instruct, GPT-4o-mini, GPT-4-turbo, Phi-1.5) but does not provide specific version numbers for any underlying software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The maximum context length is set to 4096. We include 3-shot in-context examples in the instruction prompt for both tasks. The maximum number of decision-making steps is set to 10 for Web Shop and 7 for Hot Pot QA. The number of candidate actions is set to n = 5, unless otherwise specified, for both our method and Bo N. In our previous experiments, we set the MCTS iteration to m = 30 by default. |