reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CREW-Wildfire: Benchmarking Agentic Multi-Agent Collaborations at Scale

Authors: Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. ... We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty.
Researcher Affiliation	Collaboration	Jonathan Hyun1, Nicholas R Waytowich2, Boyuan Chen1 1Duke University, 2Army Research Laboratory
Pseudocode	Yes	A.14 Baseline Pseudo Codes A.14.1 CAMON Algorithm Algorithm 1: CAMON Implementation A.14.2 COELA Algorithm Algorithm 2: COELA Implementation A.14.3 Embodied Algorithm Algorithm 3: Embodied Implementation A.14.4 HMAS-2 Algorithm Algorithm 4: HMAS Implementation
Open Source Code	No	All code, environments, data, and baselines will be released to support future research in this emerging domain.
Open Datasets	No	All code, environments, data, and baselines will be released to support future research in this emerging domain.
Dataset Splits	Yes	We ran between 3-10 random seeds on all 16 level configurations for a total of 410 trajectories. Seeds and other hyper-parameters are located in A.19. ... Table 9: Seeds Used Across All Levels
Hardware Specification	Yes	Experiments were conducted on a laptop with a 3.0 GHz CPU, RTX 3060 GPU, and 16GB RAM.
Software Dependencies	Yes	We used GPT-4o as the underlying model, with temperature set to 0 for deterministic results and a single completion per decision step. All experiments used GPT-4o (model ID: gpt-4o-2024-08-06) with a context window of 128,000 tokens.
Experiment Setup	Yes	We used GPT-4o as the underlying model, with temperature set to 0 for deterministic results and a single completion per decision step. ... Baseline-specific hyperparameters were set as follows: Embodied Implementation: Communication Rounds per Timestep: 2 Message Lifespan: 3 timesteps Max Messages: 30 messages per chat history