reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John Turner, Eric Undersander, Tsung-Yen Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in So TA models, such as poor coordination and failures in task tracking and recovery from errors. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6X faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.
Researcher Affiliation	Industry	Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John M. Turner, Eric Undersander, Tsung-Yen Yang Work done at FAIR Meta, Alphabetical author order
Pseudocode	Yes	We follow the procedure described in Geng et al. (2023), constraining token sampling to only select tokens that consistent with at least one accepting string in the specified grammar. For each call to the LLM we build a grammar which will only accept valid tool calls on observed entities. Below is the base grammar used tool calls for all experiments. For experiments utilizing a summary of the world representation (i.e. Re Act, Finetuned see Section 4.1) the perception tools (Find Object Tool, Find Receptacle Tool, etc.) are omitted.
Open Source Code	Yes	All code, datasets, and human demonstrations on PARTNR tasks will be open-sourced. Accompanying this paper, we will release the code and data necessary to reproduce our experiments. Released code includes our PARTNR benchmark tasks, metrics, baseline oracle skills, large planning model framework, and dataset generation utilities. The publicly released codebase accompanying PARTNR will be contained in a public github repository and depend on the most recent version of the AI Habitat platform (habitat-lab and habitat-sim (v0.3.2)) (Puig et al., 2024) which it extends to define collaboration tasks and skills.
Open Datasets	Yes	PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. All code, datasets, and human demonstrations on PARTNR tasks will be open-sourced. Released data includes extensions of the Habitat Synthetic Scenes Dataset (HSSD) (Khanna et al., 2024), generated benchmark task episodes, and model weights for our trained neural network skills and fine-tuned large planning model.
Dataset Splits	Yes	The PARTNR dataset comprises of 100,000 episodes in 37 train scenes, 1,000 episodes in 13 validation scenes, and 1,000 episodes in 10 test scenes from the HSSD dataset (Khanna et al., 2024).
Hardware Specification	Yes	For all experiments, LLM inferrence is performed on two Nvidia A100 GPUs using the gpt-fast inference engine Py Torch (2023). Inference on LLama-3.1-70B (using tensor parallelism over two A100s), resulted in an average generation speed of 11.43 tokens/s. Each planning step required an average of 52 tokens resulting in a latency of 4.55 seconds per planning step. ... For hosting and 70B models, as used by the Re Act baselines, we use 4 A100 GPUs per model. For hosting a smaller 8B model used by the Finetuned baselines, we use 1 A100 GPU.
Software Dependencies	Yes	The publicly released codebase accompanying PARTNR will be contained in a public github repository and depend on the most recent version of the AI Habitat platform (habitat-lab and habitat-sim (v0.3.2)) (Puig et al., 2024) which it extends to define collaboration tasks and skills.
Experiment Setup	Yes	We train the model using successful traces from the Re Act baseline, which obtains the best decentralized results. ... We use a low rank adapter Hu et al. (2021) with r = 132, ω = 128, dropout=0.01, on top of the value and query projection layers W V , W Q. We train all models on 4 A100 GPUs, with a batch size of 2 per GPU. The models are trained for 40,000 steps, which takes around 24 hours.