Tree Search for Language Model Agents
Authors: Jing Yu Koh, Stephen Marcus McAleer, Daniel Fried, Ruslan Salakhutdinov
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that this search procedure is complementary with existing LM agents, and enables these models to perform better on harder and longer horizon tasks. On Visual Web Arena (Koh et al., 2024), search improves the performance of a baseline GPT-4o (Open AI, 2024) agent by 39.7% relative to the baseline without search, setting a new state-of-the-art (SOTA) success rate of 26.4%. On Web Arena (Zhou et al., 2024b), search is also highly effective, contributing a 28.0% relative improvement (yielding a competitive success rate of 19.2%). Our results are summarized in Tab. 2. Introducing search increases success rate substantially across the board. |
| Researcher Affiliation | Academia | Jing Yu Koh EMAIL Carnegie Mellon University Stephen Mc Aleer EMAIL Carnegie Mellon University Daniel Fried EMAIL Carnegie Mellon University Ruslan Salakhutdinov EMAIL Carnegie Mellon University |
| Pseudocode | Yes | Our search procedure described in Sec. 3.3 is summarized in Algorithm. 1. |
| Open Source Code | Yes | Our code and models are publicly released at removed_for_review. |
| Open Datasets | Yes | On the challenging Visual Web Arena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On Web Arena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. |
| Dataset Splits | Yes | We run experiments on the full set of 910 Visual Web Arena (VWA) and 812 Web Arena (WA) tasks. We conduct several ablations on a subset of 200 VWA tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions various language models and techniques used (e.g., GPT-4o, Llama-3-70B-Instruct, nucleus sampling) but does not provide specific ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Our search parameters are set to d = 5, b = 5, c = 20, and we stop execution after a maximum of 5 actions. We sample actions using nucleus sampling (Holtzman et al., 2020) with a temperature of 1.0 and top-p of 0.95 for all experiments. At each step of execution, we generate 20 outputs from the model by prompting it with Co T reasoning (Wei et al., 2022). We sample 20 different paths from the GPT-4o model using ancestral sampling (temperature of 1.0 and top-p of 1.0). For all experiments, we use a webpage viewport width of 1280, a viewport height of 2048, and truncate text observations to 3840 tokens. |