Efficient Reinforcement Learning with Large Language Model Priors

Authors: Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, Jun Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios. Extensive experiments on ALFWorld and Overcooked demonstrate that our new framework can significantly boost sample efficiency compared with both pure RL and pure LLM baselines, and also bring in more robust and generalizable value function.
Researcher Affiliation Collaboration 1Institute of Automation, Chinese Academy of Science, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, China 3AI Centre, Department of Computer Science, University College London, London, UK 4University of Bristol, UK 5Huawei Technologies, London, UK
Pseudocode No The paper describes methods with equations and figures illustrating processes (e.g., Figure 1, Figure 2) but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Source code is available at https://github.com/yanxue7/RL-LLM-Prior.
Open Datasets Yes We consider three environments: ALFWorld (Shridhar et al., 2020) is a popular benchmark for examining LLM-based agents decisionmaking ability... Overcooked We use the partially observed text overcooked game(Tan et al., 2024)... Frozen Lake is a grid world game...
Dataset Splits Yes For ALFWorld (Pick), we evaluate the online training baselines on 28 tasks, such as "put the cellphone on the armchair," and test the generalization ability on 26 unseen tasks.
Hardware Specification Yes Our algorithms are trained on one machine with 2 40G A100.
Software Dependencies Yes Our algorithms are trained on one machine with 2 40G A100. based on Pytorch-GPU 2.1 and cuda12.4.
Experiment Setup Yes Our algorithms are trained on one machine with 2 40G A100. based on Pytorch-GPU 2.1 and cuda12.4. Table 6,7,8,9,10 reports the main hyper-parameters of our algorithms. For all CQL-based algorithms, we set the hyperparameter β for regulating the Q-values, as shown in Eq. 7, as 5.0. Table 4: The hyperparameters on ALFWorld(Pick) Baselines Learning Rate Epochs Batch Size Update Frequency LLM α DQN-Prior 5e-4 4 128 5 / 0.01