Infer Human’s Intentions Before Following Natural Language Instructions
Authors: Yanming Wan, Yue Wu, Yiping Wang, Jiayuan Mao, Natasha Jaques
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement a set of Transformer-based models and evaluate them over a challenging benchmark, Hand Me That. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on Hand Me That. |
| Researcher Affiliation | Academia | 1University of Washington, Seattle, WA 98195 2MIT CSAIL, Cambridge, MA 02139 EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model architecture and prediction layers in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Simon-Wan/FISER |
| Open Datasets | Yes | We evaluate our models over the Hand Me That (version 2) dataset (Wan, Mao, and Tenenbaum 2022). |
| Dataset Splits | No | The paper mentions that "Hand Me That instructions are split into four difficulty levels" and that models are evaluated on a "test set" and across "all difficulty levels", but it does not provide specific percentages, sample counts, or explicit methodology for how the dataset is split into training, validation, and test sets. |
| Hardware Specification | No | The paper discusses implementing Transformer-based models and prompting GPT-4, but it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions implementing a "Transformer-based model" and using "GPT-4", but it does not provide specific version numbers for any programming languages, libraries, or frameworks used in the implementation. |
| Experiment Setup | No | The paper describes that the model is "trained in either a multi-staged (MS) or an end-to-end (E2E) manner" and that "All the predictions are trained with cross entropy loss over corresponding supervisions". However, it does not provide specific experimental setup details such as learning rates, batch sizes, number of epochs, optimizer types, or other hyperparameters. |