reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Authors: Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wołczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Foerster, Jack Parker-Holder, Tim Rocktaeschel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games... We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks.
Researcher Affiliation	Collaboration	1AI Centre, University College London, 2IDEAS NCBR, 3University of Oxford, 4New York University, 5Anthropic, 6University of Warsaw, 7Institute of Mathematics, Polish Academy of Sciences
Pseudocode	No	The paper describes methodologies and approaches but does not include any clearly labeled pseudocode or algorithm blocks. It details the benchmark design, evaluation protocols, and results without presenting algorithms in a structured, code-like format.
Open Source Code	Yes	We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community. Code and Leaderboard at balrogai.com. ... An open-source toolkit for benchmarking long-context models on BALROG.
Open Datasets	Yes	Specifically, we recorded the dungeon levels and experience levels achieved in each game, as well as whether the game resulted in an ascension. Utilizing these statistics, we constructed a data-centric progression system where each data point represents the probability of a human player winning the game after reaching a specific dungeon level or experience level. The resulting progression curves are presented in Figure 11. For practical purposes, we define Dungeon Level 1 (Dlvl:1) and Experience Level 1 as representing 0% progression, corresponding to the game s starting point, and ascension as 100% progression. The agent s overall progress is thus determined by the highest progression achieved between the dungeon level and experience level attained.
Dataset Splits	No	The paper primarily describes evaluations on existing reinforcement learning environments (games) which are often procedurally generated. It mentions using 'multiple seeds for each environment to ensure the statistical significance of the results' and 'standard errors are computed using 10 seeds' (for Crafter), '20 seeds' (for Text World), or '5 seeds' (for Baba Is AI and Mini Hack, NLE) for evaluation runs, but it does not specify explicit training/validation/test splits of a fixed dataset for model training or evaluation in the conventional sense of supervised learning tasks.
Hardware Specification	No	The paper mentions running experiments and discusses 'budget constraints' for evaluating 'o1-preview', and 'vLLM library' for optimizing throughput by 'efficiently batching generation requests'. However, it does not provide any specific details about the hardware used, such as exact GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using 'Open AI, Gemini, and Claude' APIs and the 'vLLM library', as well as 'Net Hack Language Wrapper' and 'Baby AI-Text (Carta et al., 2023)'. However, it does not specify any version numbers for these software components or libraries, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	These evaluations are intended to serve as baselines for the benchmark. As a result, they probe zero-shot performance only. ... During each timestep of interaction, agents are prompted to output the next action as a natural language string, conditioned on their past interaction history in the environment. ... For all experiments, we use a history length of 16 observations to maintain consistency across tasks. However, participants submitting to this benchmark are allowed to modify the observation history length as needed for their respective models and experiments. ... We use multiple seeds for each environment to ensure the statistical significance of the results.