CREW-Wildfire: Benchmarking Agentic Multi-Agent Collaborations at Scale

Authors: Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. ... We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty.
Researcher Affiliation Collaboration Jonathan Hyun1, Nicholas R Waytowich2, Boyuan Chen1 1Duke University, 2Army Research Laboratory
Pseudocode Yes A.14 Baseline Pseudo Codes A.14.1 CAMON Algorithm Algorithm 1: CAMON Implementation A.14.2 COELA Algorithm Algorithm 2: COELA Implementation A.14.3 Embodied Algorithm Algorithm 3: Embodied Implementation A.14.4 HMAS-2 Algorithm Algorithm 4: HMAS Implementation
Open Source Code No All code, environments, data, and baselines will be released to support future research in this emerging domain.
Open Datasets No All code, environments, data, and baselines will be released to support future research in this emerging domain.
Dataset Splits Yes We ran between 3-10 random seeds on all 16 level configurations for a total of 410 trajectories. Seeds and other hyper-parameters are located in A.19. ... Table 9: Seeds Used Across All Levels
Hardware Specification Yes Experiments were conducted on a laptop with a 3.0 GHz CPU, RTX 3060 GPU, and 16GB RAM.
Software Dependencies Yes We used GPT-4o as the underlying model, with temperature set to 0 for deterministic results and a single completion per decision step. All experiments used GPT-4o (model ID: gpt-4o-2024-08-06) with a context window of 128,000 tokens.
Experiment Setup Yes We used GPT-4o as the underlying model, with temperature set to 0 for deterministic results and a single completion per decision step. ... Baseline-specific hyperparameters were set as follows: Embodied Implementation: Communication Rounds per Timestep: 2 Message Lifespan: 3 timesteps Max Messages: 30 messages per chat history