reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The BrowserGym Ecosystem for Web Agent Research

Authors: Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexandre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, Nicolas Chapados

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in Browser Gym.
Researcher Affiliation	Collaboration	1Service Now Research 2Mila 3Polytechnique Montréal 4Carnegie Mellon University 5Mc Gill University 6Tel Aviv University 7Université de Montréal 8i Mean AI
Pseudocode	Yes	Figure 7: Pseudo-code for creating a simple web task in Browser Gym, and the corresponding rendering.
Open Source Code	Yes	We expand the existing Browser Gym2 library from Drouin et al. (2024) (Section 3), and we provide a unification of existing benchmarks (Section 4) proposed by the scientific community in Browser Gym, ranging from Mini Wo B(++) (Shi et al., 2017; Liu et al., 2018) to more recent benchmarks such as Assistant Bench (Yoran et al., 2024) and Visual Web Arena (Koh et al., 2024a). Despite their differences, all benchmarks are made available through the same, unified Browser Gym interface. We introduce Agent Lab3 (Section 5), a set of tools to simplify parallel large-scale experimentation with agents over Browser Gym in a reproducible manner. It also comes with Agent XRay, a visual tool to introspect the behavior of agents on individual tasks. Finally, it provides reusable building blocks to accelerate the development of new agents. Footnotes: 2https://github.com/Service Now/Browser Gym 3https://github.com/Service Now/Agent Lab
Open Datasets	Yes	We bring 3 new web agent benchmarks to Browser Gym, namely Web LINX (Lù et al., 2024), Visual Web Arena (Koh et al., 2024a) and Assistant Bench (Yoran et al., 2024). With these, Browser Gym currently supports six popular web agent benchmarks, listed in Table 1. Each benchmark consists of a set of Browser Gym tasks, accessible as a Browser Gym environment through the gymnasium interface (Figure 8).
Dataset Splits	Yes	The metadata also proposes a default train/test split for each benchmark, and an optional dependency graph between tasks, which indicates a (partial) order in which tasks should be executed to avoid inconsistencies when evaluating agents (e.g., for the Web Arena (Zhou et al., 2024b) and Visual Web Arena (Koh et al., 2024a) benchmarks). ... We use the test splits for Web LINX and Assistant Bench. Finally, Work Arena L2 and L3 offer their own curricula, amounting to 235 tasks each.
Hardware Specification	Yes	Our experiments were conducted on large-scale compute clusters equipped with Intel(R) Xeon(R) Gold 6126 CPUs @ 2.60GHz and effectively unlimited RAM. ... For example, Web Arena experiments were executed on Azure VMs with 8 CPUs and 32GB RAM, which imposed limitations on execution speed and parallelization capabilities.
Software Dependencies	No	The paper mentions several software components like Chromium, Playwright, and Gymnasium, and discusses using LLM APIs (Open AI, Anthropic, Meta), but it does not provide specific version numbers for these libraries or tools that are critical for reproducing the implementation of their methodology.
Experiment Setup	Yes	Our experiments use the same agent configuration as the Work Arena++ benchmark (Boisvert et al., 2024), with the addition of the use_think_history setting which gives the agent access to its entire chain-of-thought history throughout its execution, similarly to Putta et al. (2024). ... Along with this dynamic prompting feature, Generic Agent implements a retry functionality to overcome LLM side issues or parsing errors. In the case of a parsing error, the LLM is re-prompted and gets 4 attempts to produce a parsable answer. After 4 consecutive parsing errors, the task is considered a failure. ... For Miniwob and Work Arena L1, we use respectively 5 and 10 seeds per task.