reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives

Authors: Armin Saghafian, Amirmohammad Izadi, Negin Hashemi Dijujin, Mahdieh Soleymani Baghshah

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems. Our code base is available here. Our experiments on the Mini Grid and Baby AI environments Chevalier-Boisvert et al. (2018) showcase the idea's effectiveness in improving the systematic generalization and sample efficiency of instruction-following agents.
Researcher Affiliation	Academia	Armin Saghafian, Sharif University of Technology EMAIL Amirmohammad Izadi, Sharif University of Technology EMAIL Negin Hashemi Dijujin , Sharif University of Technology EMAIL Mahdieh Soleymani Baghshah, Sharif University of Technology EMAIL
Pseudocode	Yes	Algorithm 1 CAREL framework; Algorithm 2 Instruction Tracking (IT) framework
Open Source Code	Yes	The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems. Our code base is available here.
Open Datasets	Yes	Our experiments on the Mini Grid and Baby AI environments Chevalier-Boisvert et al. (2018) showcase the idea's effectiveness in improving the systematic generalization and sample efficiency of instruction-following agents. We employ the Baby AI environment Chevalier-Boisvert et al. (2018), a lightweight but logically complex benchmark with procedurally generated difficulty levels, which enables in-depth exploration of grounded language learning in the goal-conditioned RL context.
Dataset Splits	No	The paper mentions evaluating on "unseen tasks" and reporting success rates, but does not provide specific percentages or counts for training/test/validation splits, nor does it explicitly reference predefined standard splits with detailed methodology. It states, "We report the agent's success rate (SR) over a set of unseen tasks at each Baby AI level, separated by pairs of color and type of target objects or specific orders of objects in the instruction."
Hardware Specification	Yes	For the experiments reported in this paper, we have used one NVIDIA 3090 GPU and one TITAN RTX GPU over two weeks.
Software Dependencies	No	The paper mentions using the PPO algorithm and Adam optimizer, as well as BERT's tokenizer, but does not specify version numbers for any software libraries or frameworks like Python, PyTorch, or TensorFlow. For example: "Its base model is trained using the PPO algorithm Schulman et al. (2017) and Adam optimizer with parameters β1 = 0.9 and β2 = 0.999."
Experiment Setup	Yes	The learning rate is 7e 4, and the batch size is 256. We set λC = 0.01 and the temperature τ = 1 as CAREL-specific hyperparameters. The actor-critic model from the SHELM model was also used as a baseline. We train the learnable parts of the model using the PPO algorithm and Adam optimizer with the same hyperparameters. The learning rate is 1e 4, and the batch size is set to 16.