CREW-Wildfire: Benchmarking Agentic Multi-Agent Collaborations at Scale
Authors: Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. ... We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. |
| Researcher Affiliation | Collaboration | Jonathan Hyun1, Nicholas R Waytowich2, Boyuan Chen1 1Duke University, 2Army Research Laboratory |
| Pseudocode | Yes | A.14 Baseline Pseudo Codes A.14.1 CAMON Algorithm Algorithm 1: CAMON Implementation A.14.2 COELA Algorithm Algorithm 2: COELA Implementation A.14.3 Embodied Algorithm Algorithm 3: Embodied Implementation A.14.4 HMAS-2 Algorithm Algorithm 4: HMAS Implementation |
| Open Source Code | No | All code, environments, data, and baselines will be released to support future research in this emerging domain. |
| Open Datasets | No | All code, environments, data, and baselines will be released to support future research in this emerging domain. |
| Dataset Splits | Yes | We ran between 3-10 random seeds on all 16 level configurations for a total of 410 trajectories. Seeds and other hyper-parameters are located in A.19. ... Table 9: Seeds Used Across All Levels |
| Hardware Specification | Yes | Experiments were conducted on a laptop with a 3.0 GHz CPU, RTX 3060 GPU, and 16GB RAM. |
| Software Dependencies | Yes | We used GPT-4o as the underlying model, with temperature set to 0 for deterministic results and a single completion per decision step. All experiments used GPT-4o (model ID: gpt-4o-2024-08-06) with a context window of 128,000 tokens. |
| Experiment Setup | Yes | We used GPT-4o as the underlying model, with temperature set to 0 for deterministic results and a single completion per decision step. ... Baseline-specific hyperparameters were set as follows: Embodied Implementation: Communication Rounds per Timestep: 2 Message Lifespan: 3 timesteps Max Messages: 30 messages per chat history |