QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
Authors: Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we compare the performance of our QLASS with all the baselines on Web Shop, Sci World, and ALFWorld. We evaluate all algorithms using one-shot evaluation. The decoding temperatures are set to 0.7 for QLASS and Best-of-N and 0 for other baselines. Overall Baseline Comparison. Results are summarized in Table 2. |
| Researcher Affiliation | Academia | 1University of California, Los Angeles, USA 2Shanghai Jiaotong University, Shanghai, China. Correspondence to: Yizhou Sun <EMAIL>, Kai-Wei Chang <EMAIL>. |
| Pseudocode | Yes | The overall pipeline is shown in Figure 1 and Algorithm 1. Algorithm 1 General QLASS Pipeline Algorithm 2 Constructing a Reasoning Tree Algorithm 3 Q-value Estimation Algorithm 4 Q-guided Generation |
| Open Source Code | Yes | 1We will release our code and data in https://github.com/Rafa-zy/QLASS |
| Open Datasets | Yes | We assess the ability of QLASS on Web Shop (Yao et al., 2022), ALFWorld (Shridhar et al., 2021) and Sci World (Wang et al., 2022a). |
| Dataset Splits | Yes | Table 1. The statistics of datasets (We follow the same setup as ETO (Song et al., 2024)). Test-Seen and Test-Unseen are test sets with seen and unseen cases respectively. #Turns means the average number of interaction turns for the SFT trajectories. Dataset #Train #Test-Seen #Test-Unseen #Turns Web Shop 1,938 200 4.9 Sci World 1,483 194 241 14.4 ALFWorld 3,321 140 134 10.1 |
| Hardware Specification | Yes | We train our models mainly using 4 or 8 A6000 GPUs. |
| Software Dependencies | No | The paper mentions "Llama-2-7BChat as base policy model and QNet backbone" and "bfloat16 precision" but does not provide specific software names with version numbers for libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | The detailed hyper-parameters for training and model architectures can be found in Appendix A.2. Table 5. Hyperparameters used in QLASS. Hyperparameter Value Batch size 64 Number of training epochs 3 Weight decay 0.0 Warmup ratio 0.03 Learning rate 1e-5 LR scheduler type Cosine Logging steps 5 Model max length 4096 Discount factor γ 0.9 Maximum expansion depth D on Web Shop 3 Maximum expansion depth D on Sci World 6 Maximum expansion depth D on ALFWorld 8 Action candidate set size M for inference 2 Sampled trajectory number N for self-training 1 Exploration temperature 0.7 |