reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination

Authors: Tobias Gessler, Tin Dizdarevic, Ani Calinescu, Benjamin Ellis, Andrei Lupu, Jakob Foerster

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we investigate the origins of ZSC challenges in Overcooked. We introduce a state-augmentation mechanism that mixes states that might be encountered when paired with unknown partners into the training distribution, reducing the out-of-distribution challenge associated with ZSC. Our results show that ZSC failures can largely be attributed to poor state-coverage rather than more sophisticated coordination challenges. The Overcooked environment is therefore not suitable as a ZSC benchmark. To address these shortcomings, we introduce Overcooked V21, a new version of the benchmark, which includes asymmetric information and stochasticity, facilitating the creation of interesting ZSC scenarios. To validate Overcooked V2, we demonstrate that mere exhaustive state coverage is insufficient to coordinate well. Finally, we use Overcooked V2 to build a new range of coordination challenges, including ones that require test-time protocol formation, and we demonstrate the need for new coordination algorithms that can adapt online.
Researcher Affiliation	Academia	Tobias Gessler Tin Dizdarevic Anisoara Calinescu Benjamin Ellis Andrei Lupu Jakob N. Foerster FLAIR, University of Oxford
Pseudocode	Yes	Algorithm 1 State-Augmented Self-Play Algorithm
Open Source Code	Yes	1Available in Jax MARL: https://github.com/FLAIROx/Jax MARL Experiment code is available at https://github.com/overcookedv2/experiments.
Open Datasets	Yes	The Overcooked benchmark, introduced by Carroll et al. (2020), is based on the popular video game Overcooked. We introduce a novel environment, Overcooked V2, that requires agents to coordinate for high returns. The environment is implemented as part of the popular Jax MARL framework (Rutherford et al., 2023).
Dataset Splits	No	The paper describes training agents using self-play and evaluating their performance in cross-play over a specified number of episodes (e.g., "500 episodes" for evaluation and "10 independent agent pairs" for training), but does not provide traditional fixed dataset splits (e.g., percentages or counts for training, validation, and test sets of a static dataset). The data is generated through interaction with the environment.
Hardware Specification	Yes	Our experiments were conducted on a server equipped with 8 NVIDIA A40 GPUs with 48GB of memory and an AMD EPYC 7513 32-Core Processor.
Software Dependencies	Yes	The models were trained using JAX (Bradbury et al., 2018) and FLAX (Heek et al., 2023).
Experiment Setup	Yes	The same hyperparameters are used for both the standard and stateaugmented settings; an overview is provided in Appendix 3. Appendix D provides the hyperparameters used in our experiments. (e.g., Table 3: Hyperparameters for the layouts: Cramped Room, Asymmetric advantages, Coordination Ring, Forced Coordination and Counter Circuit.)