reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Infer Human’s Intentions Before Following Natural Language Instructions

Authors: Yanming Wan, Yue Wu, Yiping Wang, Jiayuan Mao, Natasha Jaques

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement a set of Transformer-based models and evaluate them over a challenging benchmark, Hand Me That. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on Hand Me That.
Researcher Affiliation	Academia	1University of Washington, Seattle, WA 98195 2MIT CSAIL, Cambridge, MA 02139 EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture and prediction layers in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Simon-Wan/FISER
Open Datasets	Yes	We evaluate our models over the Hand Me That (version 2) dataset (Wan, Mao, and Tenenbaum 2022).
Dataset Splits	No	The paper mentions that "Hand Me That instructions are split into four difficulty levels" and that models are evaluated on a "test set" and across "all difficulty levels", but it does not provide specific percentages, sample counts, or explicit methodology for how the dataset is split into training, validation, and test sets.
Hardware Specification	No	The paper discusses implementing Transformer-based models and prompting GPT-4, but it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions implementing a "Transformer-based model" and using "GPT-4", but it does not provide specific version numbers for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	No	The paper describes that the model is "trained in either a multi-staged (MS) or an end-to-end (E2E) manner" and that "All the predictions are trained with cross entropy loss over corresponding supervisions". However, it does not provide specific experimental setup details such as learning rates, batch sizes, number of epochs, optimizer types, or other hyperparameters.