QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Authors: Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare the performance of our QLASS with all the baselines on Web Shop, Sci World, and ALFWorld. We evaluate all algorithms using one-shot evaluation. The decoding temperatures are set to 0.7 for QLASS and Best-of-N and 0 for other baselines. Overall Baseline Comparison. Results are summarized in Table 2.
Researcher Affiliation Academia 1University of California, Los Angeles, USA 2Shanghai Jiaotong University, Shanghai, China. Correspondence to: Yizhou Sun <EMAIL>, Kai-Wei Chang <EMAIL>.
Pseudocode Yes The overall pipeline is shown in Figure 1 and Algorithm 1. Algorithm 1 General QLASS Pipeline Algorithm 2 Constructing a Reasoning Tree Algorithm 3 Q-value Estimation Algorithm 4 Q-guided Generation
Open Source Code Yes 1We will release our code and data in https://github.com/Rafa-zy/QLASS
Open Datasets Yes We assess the ability of QLASS on Web Shop (Yao et al., 2022), ALFWorld (Shridhar et al., 2021) and Sci World (Wang et al., 2022a).
Dataset Splits Yes Table 1. The statistics of datasets (We follow the same setup as ETO (Song et al., 2024)). Test-Seen and Test-Unseen are test sets with seen and unseen cases respectively. #Turns means the average number of interaction turns for the SFT trajectories. Dataset #Train #Test-Seen #Test-Unseen #Turns Web Shop 1,938 200 4.9 Sci World 1,483 194 241 14.4 ALFWorld 3,321 140 134 10.1
Hardware Specification Yes We train our models mainly using 4 or 8 A6000 GPUs.
Software Dependencies No The paper mentions "Llama-2-7BChat as base policy model and QNet backbone" and "bfloat16 precision" but does not provide specific software names with version numbers for libraries, frameworks, or programming languages.
Experiment Setup Yes The detailed hyper-parameters for training and model architectures can be found in Appendix A.2. Table 5. Hyperparameters used in QLASS. Hyperparameter Value Batch size 64 Number of training epochs 3 Weight decay 0.0 Warmup ratio 0.03 Learning rate 1e-5 LR scheduler type Cosine Logging steps 5 Model max length 4096 Discount factor γ 0.9 Maximum expansion depth D on Web Shop 3 Maximum expansion depth D on Sci World 6 Maximum expansion depth D on ALFWorld 8 Action candidate set size M for inference 2 Sampled trajectory number N for self-training 1 Exploration temperature 0.7