Efficient Reinforcement Learning with Large Language Model Priors
Authors: Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, Jun Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios. Extensive experiments on ALFWorld and Overcooked demonstrate that our new framework can significantly boost sample efficiency compared with both pure RL and pure LLM baselines, and also bring in more robust and generalizable value function. |
| Researcher Affiliation | Collaboration | 1Institute of Automation, Chinese Academy of Science, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, China 3AI Centre, Department of Computer Science, University College London, London, UK 4University of Bristol, UK 5Huawei Technologies, London, UK |
| Pseudocode | No | The paper describes methods with equations and figures illustrating processes (e.g., Figure 1, Figure 2) but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Source code is available at https://github.com/yanxue7/RL-LLM-Prior. |
| Open Datasets | Yes | We consider three environments: ALFWorld (Shridhar et al., 2020) is a popular benchmark for examining LLM-based agents decisionmaking ability... Overcooked We use the partially observed text overcooked game(Tan et al., 2024)... Frozen Lake is a grid world game... |
| Dataset Splits | Yes | For ALFWorld (Pick), we evaluate the online training baselines on 28 tasks, such as "put the cellphone on the armchair," and test the generalization ability on 26 unseen tasks. |
| Hardware Specification | Yes | Our algorithms are trained on one machine with 2 40G A100. |
| Software Dependencies | Yes | Our algorithms are trained on one machine with 2 40G A100. based on Pytorch-GPU 2.1 and cuda12.4. |
| Experiment Setup | Yes | Our algorithms are trained on one machine with 2 40G A100. based on Pytorch-GPU 2.1 and cuda12.4. Table 6,7,8,9,10 reports the main hyper-parameters of our algorithms. For all CQL-based algorithms, we set the hyperparameter β for regulating the Q-values, as shown in Eq. 7, as 5.0. Table 4: The hyperparameters on ALFWorld(Pick) Baselines Learning Rate Epochs Batch Size Update Frequency LLM α DQN-Prior 5e-4 4 128 5 / 0.01 |