reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination

Authors: Kunal Jha, Wilka Carvalho, Yancheng Liang, Simon Shaolei Du, Max Kleiman-Weiner, Natasha Jaques

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive simulated and human experiments to evaluate the performance of CEC agents against state-of-the-art (SOTA) baselines. Our human study reveals that CEC agents outperform PBT on performance and outperform all methods on subjective measures of cooperation.
Researcher Affiliation	Academia	1Department of Computer Science, University of Washington, Seattle, WA 2Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, Massachusetts.
Pseudocode	Yes	Algorithm 1 Solvable Overcooked Coordination Challenge Generation
Open Source Code	Yes	Code for environment, training, and testing scripts and more can be found at https://kjha02.github.io/ publication/cross-env-coop. Our human-AI experiments and surveys utilized the Nice Web RL Python package (https://github.com/wcarvalho/nicewebrl), which leverages Jax’s parallelizability to efficiently crowd-source participant data on reinforcement learning environments.
Open Datasets	No	The paper describes how they procedurally generate environments for their experiments, such as 'Our procedural generator creates new coordination challenges in Overcooked, shown in Figure 5.' It does not provide access information (link, DOI, or citation) for a fixed, pre-existing dataset.
Dataset Splits	Yes	Note that we hold out those five layouts from the CEC generator, so that when we evaluate CEC on these layouts we are able to test generalization across both partners and tasks. Second, we introduce an additional evaluation setting where we have the Overcooked procedural environment generator create 100 coordination challenges that neither the ST baselines nor any of the CEC agents have seen during training and assess how well the different approaches can generalize to both novel partners and novel environments.
Hardware Specification	No	Jax allows us to run the entire training and evaluation pipeline, from the environment generation to the neural network updating of agents, at 10 million steps per minute on a single GPU. No specific GPU model or other hardware details are provided.
Software Dependencies	No	The paper mentions 'Jax-based' environments and the 'Nice Web RL Python package (https://github.com/wcarvalho/nicewebrl)', but does not provide specific version numbers for Jax, Python, or the Nice Web RL package.
Experiment Setup	Yes	We leverage this speed to train CEC agents, and all other baselines, for 3 billion steps. For each copy of the CEC agent, we perform an additional 100 million steps of training on a single layout with a reduced learning rate, again in self-play using IPPO. We train six seeds for each type of agent. We use the parameters in Table 2 for training all PPO agents: LR 3e-4, NUM_STEPS 256, TOTAL_TIMESTEPS 3e9, UPDATE_EPOCHS 4, NUM_MINIBATCHES 2, GAMMA 0.99, GAE_LAMBDA 0.95, CLIP_EPS 0.2, ENT_COEF 0.005, VF_COEF 1.0, MAX_GRAD_NORM 0.5, ANNEAL_LR True.