reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Authors: Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we compare the performance of our QLASS with all the baselines on Web Shop, Sci World, and ALFWorld. We evaluate all algorithms using one-shot evaluation. The decoding temperatures are set to 0.7 for QLASS and Best-of-N and 0 for other baselines. Overall Baseline Comparison. Results are summarized in Table 2.
Researcher Affiliation	Academia	1University of California, Los Angeles, USA 2Shanghai Jiaotong University, Shanghai, China. Correspondence to: Yizhou Sun <EMAIL>, Kai-Wei Chang <EMAIL>.
Pseudocode	Yes	The overall pipeline is shown in Figure 1 and Algorithm 1. Algorithm 1 General QLASS Pipeline Algorithm 2 Constructing a Reasoning Tree Algorithm 3 Q-value Estimation Algorithm 4 Q-guided Generation
Open Source Code	Yes	1We will release our code and data in https://github.com/Rafa-zy/QLASS
Open Datasets	Yes	We assess the ability of QLASS on Web Shop (Yao et al., 2022), ALFWorld (Shridhar et al., 2021) and Sci World (Wang et al., 2022a).
Dataset Splits	Yes	Table 1. The statistics of datasets (We follow the same setup as ETO (Song et al., 2024)). Test-Seen and Test-Unseen are test sets with seen and unseen cases respectively. #Turns means the average number of interaction turns for the SFT trajectories. Dataset #Train #Test-Seen #Test-Unseen #Turns Web Shop 1,938 200 4.9 Sci World 1,483 194 241 14.4 ALFWorld 3,321 140 134 10.1
Hardware Specification	Yes	We train our models mainly using 4 or 8 A6000 GPUs.
Software Dependencies	No	The paper mentions "Llama-2-7BChat as base policy model and QNet backbone" and "bfloat16 precision" but does not provide specific software names with version numbers for libraries, frameworks, or programming languages.
Experiment Setup	Yes	The detailed hyper-parameters for training and model architectures can be found in Appendix A.2. Table 5. Hyperparameters used in QLASS. Hyperparameter Value Batch size 64 Number of training epochs 3 Weight decay 0.0 Warmup ratio 0.03 Learning rate 1e-5 LR scheduler type Cosine Logging steps 5 Model max length 4096 Discount factor γ 0.9 Maximum expansion depth D on Web Shop 3 Maximum expansion depth D on Sci World 6 Maximum expansion depth D on ALFWorld 8 Action candidate set size M for inference 2 Sampled trajectory number N for self-training 1 Exploration temperature 0.7