reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tree Search for Language Model Agents

Authors: Jing Yu Koh, Stephen Marcus McAleer, Daniel Fried, Ruslan Salakhutdinov

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that this search procedure is complementary with existing LM agents, and enables these models to perform better on harder and longer horizon tasks. On Visual Web Arena (Koh et al., 2024), search improves the performance of a baseline GPT-4o (Open AI, 2024) agent by 39.7% relative to the baseline without search, setting a new state-of-the-art (SOTA) success rate of 26.4%. On Web Arena (Zhou et al., 2024b), search is also highly effective, contributing a 28.0% relative improvement (yielding a competitive success rate of 19.2%). Our results are summarized in Tab. 2. Introducing search increases success rate substantially across the board.
Researcher Affiliation	Academia	Jing Yu Koh EMAIL Carnegie Mellon University Stephen Mc Aleer EMAIL Carnegie Mellon University Daniel Fried EMAIL Carnegie Mellon University Ruslan Salakhutdinov EMAIL Carnegie Mellon University
Pseudocode	Yes	Our search procedure described in Sec. 3.3 is summarized in Algorithm. 1.
Open Source Code	Yes	Our code and models are publicly released at removed_for_review.
Open Datasets	Yes	On the challenging Visual Web Arena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On Web Arena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.
Dataset Splits	Yes	We run experiments on the full set of 910 Visual Web Arena (VWA) and 812 Web Arena (WA) tasks. We conduct several ablations on a subset of 200 VWA tasks.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions various language models and techniques used (e.g., GPT-4o, Llama-3-70B-Instruct, nucleus sampling) but does not provide specific ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Our search parameters are set to d = 5, b = 5, c = 20, and we stop execution after a maximum of 5 actions. We sample actions using nucleus sampling (Holtzman et al., 2020) with a temperature of 1.0 and top-p of 0.95 for all experiments. At each step of execution, we generate 20 outputs from the model by prompting it with Co T reasoning (Wei et al., 2022). We sample 20 different paths from the GPT-4o model using ancestral sampling (temperature of 1.0 and top-p of 1.0). For all experiments, we use a webpage viewport width of 1280, a viewport height of 2048, and truncate text observations to 3840 tokens.