Robotouille: An Asynchronous Planning Benchmark for LLM Agents

Authors: Gonzalo Gonzalez-Pumariega, Leong Yean, Neha Sunkara, Sanjiban Choudhury

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce ROBOTOUILLE, a challenging benchmark environment designed to test LLM agents ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that Re Act (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution.
Researcher Affiliation Academia Gonzalo Gonzalez-Pumariega , Leong Su Yean, Neha Sunkara, Sanjiban Choudhury Cornell University Corresponding author. Email: EMAIL
Pseudocode No The paper describes the MDP formulation and JSON structures in Figure 2, but does not contain a distinct pseudocode or algorithm block for any specific method or procedure.
Open Source Code Yes Code is available here. All prompts and few-shot examples are located in our codebase here.
Open Datasets Yes We introduce ROBOTOUILLE, a simulator for cooking diverse recipes designed to stress test LLM agents (Figure 1). ROBOTOUILLE tests asynchronous planning through tasks that take time like cooking meat for burgers or sandwiches or filling up a pot with water to cook soup. Its fully customizable JSON backend allows for the addition of new states, actions, and goals simplifying the creation of diverse long-horizon tasks. Finally, ROBOTOUILLE supports turn-based and real-time multi-agent execution either locally or on the network. In addition, we provide 3 datasets to test LLM agents synchronous, asynchronous, and multi-agent planning capabilities. ... Code is available here.
Dataset Splits No Each dataset contains 10 unique tasks and has 10 procedurally generated instances. Each baseline receives a single in-context example on a training example excluded from the testing set. While the paper indicates the existence of training examples and testing sets, it does not provide specific percentages or counts for these splits.
Hardware Specification Yes The following experiments were run on 2 or 4 NVIDIA RTX 6000 Adas using FP8 quantization.
Software Dependencies No The paper mentions various LLM models (e.g., gpt4-o, gpt-4o-mini, gemini-1.5-flash, claude-3-haiku, Qwen2-72B-Instruct, Meta-Llama-3.1-70b-Instruct) and a quantization technique (FP8 quantization), but does not provide specific version numbers for general software dependencies like Python, PyTorch, CUDA, or other libraries.
Experiment Setup Yes We use temperature 0.7 for all models. Each baseline receives a single in-context example on a training example excluded from the testing set. We use an ablated version of Re Act that only keeps the reasoning and action of the previous timestep in context (along with the base prompt and in-context examples).