Zero-shot Model-based Reinforcement Learning using Large Language Models
Authors: Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen Redko, Balázs Kégl
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments further demonstrate that our approach produces well-calibrated uncertainty estimates. We present proof-of-concept applications in two reinforcement learning settings: model-based policy evaluation and data-augmented off-policy reinforcement learning, supported by theoretical analysis of the proposed methods. Our experiments further demonstrate that our approach produces well-calibrated uncertainty estimates. Table 1: Comparison of different LLMs. Results are average over 5 episodes from each one of 7 D4RL (Fu et al., 2021) tasks. |
| Researcher Affiliation | Collaboration | Abdelhakim Benechehab 12, Youssef Attia El Hili1, Ambroise Odonnat13, Oussama Zekri 4, Albert Thomas1, Giuseppe Paolo1, Maurizio Filippone5, Ievgen Redko1, Bal azs K egl1 1 Huawei Noah s Ark Lab, Paris, France 2 Department of Data Science, EURECOM 3 Inria, Univ. Rennes 2, CNRS, IRISA 4 ENS Paris-Saclay 5 Statistics Program, KAUST |
| Pseudocode | Yes | Algorithm 1 ICLθ (Liu et al., 2024b; Gruver et al., 2023b) Input: Time series (xi)i t, LLM pθ, sub-vocabulary Vnum 1. Tokenize time series ˆxt = x1 1x2 1 . . . xk 1, . . . 2. logits pθ(ˆxt) 3. {P(Xi+1|xi, . . . , x0)}i t softmax(logits(Vnum)) Return: {P(Xi+1|xi, . . . , x0)}i t |
| Open Source Code | Yes | We release the code at https://github.com/abenechehab/dicl. |
| Open Datasets | Yes | Table 1: Comparison of different LLMs. Results are average over 5 episodes from each one of 7 D4RL (Fu et al., 2021) tasks. We first observe that the LLM-based dynamics forecasters exhibit a burn-in phase ( 70 steps in Fig. 4b) that is necessary for the LLM to gather enough context. For multi-step prediction, Fig. 4a, showing the average MSE over prediction horizons and trajectories, demonstrates that both versions of DICL improve over the vanilla approach and the MLP baseline trained on the context data, in almost all state dimensions. |
| Dataset Splits | Yes | Table 1: Comparison of different LLMs. Results are average over 5 episodes from each one of 7 D4RL (Fu et al., 2021) tasks. Each model is fed with 5 randomly sampled trajectories of length T = 300 from the D4RL datasets: expert, medium, and random. |
| Hardware Specification | No | The paper mentions using Llama 3 8B model and Llama 3.2-1B model, but does not provide specific details on the hardware (e.g., GPU models, CPU models, memory) used to run the experiments. |
| Software Dependencies | No | This work was made possible thanks to open-source software, including Python (Van Rossum & Drake Jr, 1995), Py Torch (Paszke et al., 2019), Scikit-learn (Pedregosa et al., 2011), and Clean RL (Huang et al., 2022). The specific version numbers for PyTorch, Scikit-learn, and Clean RL are not provided. |
| Experiment Setup | Yes | We specify in Table 2 the complete list of hyperparameters used for every considered environment. Table 2: SAC hyperparameters. Environment Half Cheetah Hopper Pendulum Update frequency 1000 1000 200 Learning starts 5000 5000 1000 Batch size 128 128 64 Total timesteps 1e6 1e6 1e4 Gamma γ 0.99 0.99 0.99 policy learning rate 3e 4 3e 4 3e 4. Table 3 shows all DICL-SAC hyperparameter choices for the considered environments. |