reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Zero-shot Model-based Reinforcement Learning using Large Language Models

Authors: Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen Redko, Balázs Kégl

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments further demonstrate that our approach produces well-calibrated uncertainty estimates. We present proof-of-concept applications in two reinforcement learning settings: model-based policy evaluation and data-augmented off-policy reinforcement learning, supported by theoretical analysis of the proposed methods. Our experiments further demonstrate that our approach produces well-calibrated uncertainty estimates. Table 1: Comparison of different LLMs. Results are average over 5 episodes from each one of 7 D4RL (Fu et al., 2021) tasks.
Researcher Affiliation	Collaboration	Abdelhakim Benechehab 12, Youssef Attia El Hili1, Ambroise Odonnat13, Oussama Zekri 4, Albert Thomas1, Giuseppe Paolo1, Maurizio Filippone5, Ievgen Redko1, Bal azs K egl1 1 Huawei Noah s Ark Lab, Paris, France 2 Department of Data Science, EURECOM 3 Inria, Univ. Rennes 2, CNRS, IRISA 4 ENS Paris-Saclay 5 Statistics Program, KAUST
Pseudocode	Yes	Algorithm 1 ICLθ (Liu et al., 2024b; Gruver et al., 2023b) Input: Time series (xi)i t, LLM pθ, sub-vocabulary Vnum 1. Tokenize time series ˆxt = x1 1x2 1 . . . xk 1, . . . 2. logits pθ(ˆxt) 3. {P(Xi+1\|xi, . . . , x0)}i t softmax(logits(Vnum)) Return: {P(Xi+1\|xi, . . . , x0)}i t
Open Source Code	Yes	We release the code at https://github.com/abenechehab/dicl.
Open Datasets	Yes	Table 1: Comparison of different LLMs. Results are average over 5 episodes from each one of 7 D4RL (Fu et al., 2021) tasks. We first observe that the LLM-based dynamics forecasters exhibit a burn-in phase ( 70 steps in Fig. 4b) that is necessary for the LLM to gather enough context. For multi-step prediction, Fig. 4a, showing the average MSE over prediction horizons and trajectories, demonstrates that both versions of DICL improve over the vanilla approach and the MLP baseline trained on the context data, in almost all state dimensions.
Dataset Splits	Yes	Table 1: Comparison of different LLMs. Results are average over 5 episodes from each one of 7 D4RL (Fu et al., 2021) tasks. Each model is fed with 5 randomly sampled trajectories of length T = 300 from the D4RL datasets: expert, medium, and random.
Hardware Specification	No	The paper mentions using Llama 3 8B model and Llama 3.2-1B model, but does not provide specific details on the hardware (e.g., GPU models, CPU models, memory) used to run the experiments.
Software Dependencies	No	This work was made possible thanks to open-source software, including Python (Van Rossum & Drake Jr, 1995), Py Torch (Paszke et al., 2019), Scikit-learn (Pedregosa et al., 2011), and Clean RL (Huang et al., 2022). The specific version numbers for PyTorch, Scikit-learn, and Clean RL are not provided.
Experiment Setup	Yes	We specify in Table 2 the complete list of hyperparameters used for every considered environment. Table 2: SAC hyperparameters. Environment Half Cheetah Hopper Pendulum Update frequency 1000 1000 200 Learning starts 5000 5000 1000 Batch size 128 128 64 Total timesteps 1e6 1e6 1e4 Gamma γ 0.99 0.99 0.99 policy learning rate 3e 4 3e 4 3e 4. Table 3 shows all DICL-SAC hyperparameter choices for the considered environments.