Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Compositional Instruction Following with Language Models and Reinforcement Learning

Authors: Vanya Cohen, Geraud Nangue Tasse, Nakul Gopalan, Steven James, Matthew Gombolay, Ray Mooney, Benjamin Rosman

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method significantly outperforms the previous best non-compositional baseline in terms of sample complexity on 162 tasks designed to test compositional generalization. Our model attains a higher success rate and learns in fewer steps than the non-compositional baseline. It reaches a success rate equal to an oracle policy s upper-bound performance of 92%. With the same number of environment steps, the baseline only reaches a success rate of 80%. We evaluate our approach in an environment requiring function approximation and demonstrate compositional generalization to novel tasks. We conduct experiments across four agent types and two settings. The first experiment evaluates sample complexity (Figure 3).
Researcher Affiliation Academia Vanya Cohen EMAIL The University of Texas at Austin; Geraud Nangue Tasse EMAIL University of the Witwatersrand; Nakul Gopalan EMAIL Arizona State University; Steven James EMAIL University of the Witwatersrand; Matthew Gombolay EMAIL Georgia Institute of Technology; Raymond Mooney EMAIL The University of Texas at Austin; Benjamin Rosman EMAIL University of the Witwatersrand
Pseudocode No The paper describes the methods in prose and with diagrams (e.g., Figure 1: Pipeline diagram of the learning process for the CERLLA agent), but does not contain any explicit sections or blocks labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository. It mentions using GPT-4 and GPT-3.5 and links to the OpenAI models documentation, which is a third-party tool, but this is not the authors' own code for the paper's methodology.
Open Datasets Yes We solve 162 unique tasks within an augmented Mini Grid-Baby AI domain (Chevalier-Boisvert et al., 2023; 2019). To evaluate our method, we select the Baby AI Mini Grid domain (Chevalier Boisvert et al., 2019), an easily extensible test-bed for compositional language-RL tasks used in many recent language-RL works including (Carta et al., 2022; Li et al., 2022).
Dataset Splits Yes This experiment (Figure 5) measures the generalization performance of each method on held-out tasks. ... In this setting, the set of tasks is randomly split into two halves at the start of training. At each episode, a random task from the first set is selected. During evaluation of our agent, one random task from each set is selected and the agent is evaluated over 100 episodes. The baseline agents are evaluated over all 81 tasks in each set.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or cloud computing instance specifications.
Software Dependencies No The paper mentions specific LLM models (GPT-4 and GPT-3.5) and a specific model from a library (frozen all-mpnet-base-v2 model from the Sentence Transformers library), but it does not provide specific version numbers for the Sentence Transformers library itself or any other key software dependencies or frameworks (e.g., PyTorch, TensorFlow) used for the implementation.
Experiment Setup Yes Table 5: Hyperparameters for the LLM Agent. LLM (GPT-4 and GPT 3.5), Beam Width (10), Rollouts (100 episodes), In-Context Examples (10), Training Temperature (1.0), Evaluation Temperature (0.0). Table 6: Hyperparameters for world value function pretraining. Optimizer (Adam), Learning rate (1e-4), Batch Size (32), Replay Buffer Size (1e3), epsilon init (0.5), epsilon final (0.1), epsilon decay steps (1e6).