reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Reinforcement Learning with Large Language Model Priors

Authors: Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, Jun Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios. Extensive experiments on ALFWorld and Overcooked demonstrate that our new framework can significantly boost sample efficiency compared with both pure RL and pure LLM baselines, and also bring in more robust and generalizable value function.
Researcher Affiliation	Collaboration	1Institute of Automation, Chinese Academy of Science, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, China 3AI Centre, Department of Computer Science, University College London, London, UK 4University of Bristol, UK 5Huawei Technologies, London, UK
Pseudocode	No	The paper describes methods with equations and figures illustrating processes (e.g., Figure 1, Figure 2) but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Source code is available at https://github.com/yanxue7/RL-LLM-Prior.
Open Datasets	Yes	We consider three environments: ALFWorld (Shridhar et al., 2020) is a popular benchmark for examining LLM-based agents decisionmaking ability... Overcooked We use the partially observed text overcooked game(Tan et al., 2024)... Frozen Lake is a grid world game...
Dataset Splits	Yes	For ALFWorld (Pick), we evaluate the online training baselines on 28 tasks, such as "put the cellphone on the armchair," and test the generalization ability on 26 unseen tasks.
Hardware Specification	Yes	Our algorithms are trained on one machine with 2 40G A100.
Software Dependencies	Yes	Our algorithms are trained on one machine with 2 40G A100. based on Pytorch-GPU 2.1 and cuda12.4.
Experiment Setup	Yes	Our algorithms are trained on one machine with 2 40G A100. based on Pytorch-GPU 2.1 and cuda12.4. Table 6,7,8,9,10 reports the main hyper-parameters of our algorithms. For all CQL-based algorithms, we set the hyperparameter β for regulating the Q-values, as shown in Eq. 7, as 5.0. Table 4: The hyperparameters on ALFWorld(Pick) Baselines Learning Rate Epochs Batch Size Update Frequency LLM α DQN-Prior 5e-4 4 128 5 / 0.01