reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robotouille: An Asynchronous Planning Benchmark for LLM Agents

Authors: Gonzalo Gonzalez-Pumariega, Leong Yean, Neha Sunkara, Sanjiban Choudhury

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce ROBOTOUILLE, a challenging benchmark environment designed to test LLM agents ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that Re Act (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution.
Researcher Affiliation	Academia	Gonzalo Gonzalez-Pumariega , Leong Su Yean, Neha Sunkara, Sanjiban Choudhury Cornell University Corresponding author. Email: EMAIL
Pseudocode	No	The paper describes the MDP formulation and JSON structures in Figure 2, but does not contain a distinct pseudocode or algorithm block for any specific method or procedure.
Open Source Code	Yes	Code is available here. All prompts and few-shot examples are located in our codebase here.
Open Datasets	Yes	We introduce ROBOTOUILLE, a simulator for cooking diverse recipes designed to stress test LLM agents (Figure 1). ROBOTOUILLE tests asynchronous planning through tasks that take time like cooking meat for burgers or sandwiches or filling up a pot with water to cook soup. Its fully customizable JSON backend allows for the addition of new states, actions, and goals simplifying the creation of diverse long-horizon tasks. Finally, ROBOTOUILLE supports turn-based and real-time multi-agent execution either locally or on the network. In addition, we provide 3 datasets to test LLM agents synchronous, asynchronous, and multi-agent planning capabilities. ... Code is available here.
Dataset Splits	No	Each dataset contains 10 unique tasks and has 10 procedurally generated instances. Each baseline receives a single in-context example on a training example excluded from the testing set. While the paper indicates the existence of training examples and testing sets, it does not provide specific percentages or counts for these splits.
Hardware Specification	Yes	The following experiments were run on 2 or 4 NVIDIA RTX 6000 Adas using FP8 quantization.
Software Dependencies	No	The paper mentions various LLM models (e.g., gpt4-o, gpt-4o-mini, gemini-1.5-flash, claude-3-haiku, Qwen2-72B-Instruct, Meta-Llama-3.1-70b-Instruct) and a quantization technique (FP8 quantization), but does not provide specific version numbers for general software dependencies like Python, PyTorch, CUDA, or other libraries.
Experiment Setup	Yes	We use temperature 0.7 for all models. Each baseline receives a single in-context example on a training example excluded from the testing set. We use an ablated version of Re Act that only keeps the reasoning and action of the previous timestep in context (along with the base prompt and in-context examples).