reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicitly Aligning Humans and Autonomous Agents through Shared Task Abstractions

Authors: Stéphane Aroca-Ouellette, Miguel Aroca-Ouellette, Katharina von der Wense, Alessandro Roncone

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate HA2 in the Overcooked environment, demonstrating statistically significant improvement over baselines when paired with both unseen agents and humans, providing better resilience to environmental shifts, and outperforming state-of-the-art methods. [...] HA2 offers statistically significant advantages with highlighted by the following contributions: 1) HA2 outperforms all baselines by over 18.0% when paired with a set of unseen agents, and 2) by over 18.3% when paired with humans. Moreover, 3) HA2 is significantly preferred by humans, and found to be more fluent, trusted, and cooperative than baselines. To further test the generalizability provided by hierarchical structures, we test the agents zeroshot on modified versions of the game layouts and show that 4) HA2 is more robust environmental changes, outperforming baselines by more than 10.5x on these layouts.
Researcher Affiliation	Academia	Stéphane Aroca-Ouellette1 , Miguel Aroca-Ouellette2 , Katharina von der Wense1,3 and Alessandro Roncone1 1University of Colorado Boulder 2Independent Researcher 3Johannes-Gutenberg Universität Mainz EMAIL
Pseudocode	No	The paper describes the methodology using prose and diagrams (e.g., Fig. 2 for the HA2 architecture) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/HIRO-group/HA2.
Open Datasets	Yes	Following prior work in zero-shot human-AI teaming [Carroll et al., 2019; Strouse et al., 2021; Aroca-Ouellette et al., 2023], we study the use of hierarchical structures using all five layouts in the Overcooked environment developed by [Carroll et al., 2019]. To train the BC models, we closely follow the implementation in [Carroll et al., 2019], using their feature encoding as observation as well as their provided data.
Dataset Splits	Yes	Following [Carroll et al., 2019], We divide the data in half, and train two models. The better model is used as the human proxy, and the worse model as the BC model.
Hardware Specification	Yes	To keep a fair comparison, we train each agent for 48 hours using the same V100 GPU.
Software Dependencies	No	The paper mentions various deep learning techniques (e.g., PPO, CNN, MLP, recurrent networks, frame stacking) but does not provide specific software names with version numbers, such as PyTorch, TensorFlow, or Python versions.
Experiment Setup	Yes	To keep a fair comparison, we train each agent for 48 hours using the same V100 GPU. For HA2, we use 24 hours for the Worker and 24 hours for the Manager. The 48 hours equate to 119 million timesteps for BCP, 119 million timesteps for FCP, and 66 million timesteps for HA2 (31 million for the Worker and 35 million for the Manager). We train five iterations of each of the four agents using different random seeds and report the mean and standard error across seeds. To train the BC models, [...] The BC agents were trained for 300 epochs. Each population agent was trained for 10 million in-game steps. [...] We flatten the observation and pass it through a two-layer multilayer perceptron (MLP). The reward structure is the same as the base environment with a reward of 20 for each soup served.