The BrowserGym Ecosystem for Web Agent Research
Authors: Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexandre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, Nicolas Chapados
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in Browser Gym. |
| Researcher Affiliation | Collaboration | 1Service Now Research 2Mila 3Polytechnique Montréal 4Carnegie Mellon University 5Mc Gill University 6Tel Aviv University 7Université de Montréal 8i Mean AI |
| Pseudocode | Yes | Figure 7: Pseudo-code for creating a simple web task in Browser Gym, and the corresponding rendering. |
| Open Source Code | Yes | We expand the existing Browser Gym2 library from Drouin et al. (2024) (Section 3), and we provide a unification of existing benchmarks (Section 4) proposed by the scientific community in Browser Gym, ranging from Mini Wo B(++) (Shi et al., 2017; Liu et al., 2018) to more recent benchmarks such as Assistant Bench (Yoran et al., 2024) and Visual Web Arena (Koh et al., 2024a). Despite their differences, all benchmarks are made available through the same, unified Browser Gym interface. We introduce Agent Lab3 (Section 5), a set of tools to simplify parallel large-scale experimentation with agents over Browser Gym in a reproducible manner. It also comes with Agent XRay, a visual tool to introspect the behavior of agents on individual tasks. Finally, it provides reusable building blocks to accelerate the development of new agents. Footnotes: 2https://github.com/Service Now/Browser Gym 3https://github.com/Service Now/Agent Lab |
| Open Datasets | Yes | We bring 3 new web agent benchmarks to Browser Gym, namely Web LINX (Lù et al., 2024), Visual Web Arena (Koh et al., 2024a) and Assistant Bench (Yoran et al., 2024). With these, Browser Gym currently supports six popular web agent benchmarks, listed in Table 1. Each benchmark consists of a set of Browser Gym tasks, accessible as a Browser Gym environment through the gymnasium interface (Figure 8). |
| Dataset Splits | Yes | The metadata also proposes a default train/test split for each benchmark, and an optional dependency graph between tasks, which indicates a (partial) order in which tasks should be executed to avoid inconsistencies when evaluating agents (e.g., for the Web Arena (Zhou et al., 2024b) and Visual Web Arena (Koh et al., 2024a) benchmarks). ... We use the test splits for Web LINX and Assistant Bench. Finally, Work Arena L2 and L3 offer their own curricula, amounting to 235 tasks each. |
| Hardware Specification | Yes | Our experiments were conducted on large-scale compute clusters equipped with Intel(R) Xeon(R) Gold 6126 CPUs @ 2.60GHz and effectively unlimited RAM. ... For example, Web Arena experiments were executed on Azure VMs with 8 CPUs and 32GB RAM, which imposed limitations on execution speed and parallelization capabilities. |
| Software Dependencies | No | The paper mentions several software components like Chromium, Playwright, and Gymnasium, and discusses using LLM APIs (Open AI, Anthropic, Meta), but it does not provide specific version numbers for these libraries or tools that are critical for reproducing the *implementation* of their methodology. |
| Experiment Setup | Yes | Our experiments use the same agent configuration as the Work Arena++ benchmark (Boisvert et al., 2024), with the addition of the use_think_history setting which gives the agent access to its entire chain-of-thought history throughout its execution, similarly to Putta et al. (2024). ... Along with this dynamic prompting feature, Generic Agent implements a retry functionality to overcome LLM side issues or parsing errors. In the case of a parsing error, the LLM is re-prompted and gets 4 attempts to produce a parsable answer. After 4 consecutive parsing errors, the task is considered a failure. ... For Miniwob and Work Arena L1, we use respectively 5 and 10 seeds per task. |