reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling

Authors: Jasmine Bayrooti, Carl Ek, Amanda Prorok

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration. We evaluate the performance of our proposed algorithm, HOT-GP, on continuous state-action control tasks by measuring the average sum of rewards accumulated during evaluation episodes. First, we compare these approaches on standard Mu Jo Co benchmark tasks (Todorov et al., 2012) and extended Mu Jo Co sparse maze tasks from the D4RL suite (Fu et al., 2020) against additional modelfree and model-based methods. Then we evaluate these approaches on a practical robotics task using the VMAS simulator (Bettini et al., 2022) and provide an analysis of the individual components.
Researcher Affiliation	Academia	Jasmine Bayrooti University of Cambridge EMAIL Carl Henrik Ek University of Cambridge Karolinska Institutet EMAIL Amanda Prorok University of Cambridge EMAIL
Pseudocode	Yes	Algorithm 1 Model-based Policy Optimization 1: Require: max environment steps N, model rollouts M, steps per rollout T, steps per model rollout K, initial state distribution d(s0), policy-learning algorithm Policy Search 2: Initialize: policy π, reward-dynamics model p(f), environment dataset Denv 3: while \|Denv\| < N do 4: /Simulate Data/ 5: Initialize model dataset Dmodel = 6: for m = 1, 2, . . . , M do 7: Sample ˆs0 uniformly from Denv 8: for k = 1, . . . , K do 9: Compute action ˆak from π(ˆsk) 10: Select next state ˆsk+1 and reward ˆrk using p(f \| Denv) This is algorithm-specific 11: Append transition to buffer (ˆsk, ˆak, ˆsk+1, ˆrk) Dmodel 12: /Optimize Policy/ 13: π Policy Search(π, Dmodel) 14: /Optimize Dynamics Model/ 15: Start from initial state s0 d(s0) 16: for t = 1, 2, . . . , T do 17: Compute action at from π(st) 18: Observe next state and reward st+1, rt = f(st, at) 19: Append transition to buffer (st, at, st+1, rt) Denv 20: Retrain model p(f) using Denv
Open Source Code	Yes	code for reproducing the experiments can be found at https://github.com/jbayrooti/hot_gp.
Open Datasets	Yes	We apply our method on a set of Mu Jo Co and VMAS continuous control tasks. We evaluate this approach on a set of Mu Jo Co (Todorov et al., 2012) and VMAS (Bettini et al., 2022) continuous control tasks. First, we compare these approaches on standard Mu Jo Co benchmark tasks (Todorov et al., 2012) and extended Mu Jo Co sparse maze tasks from the D4RL suite (Fu et al., 2020) against additional modelfree and model-based methods.
Dataset Splits	No	The paper describes an online reinforcement learning setup where data is collected through interactions with the environment (Mu Jo Co, VMAS). It refers to an 'environment dataset Denv' and 'model dataset Dmodel' which are dynamically updated. This is not a traditional supervised learning scenario where static training, validation, and test splits are predefined and explicitly mentioned with percentages or counts. The paper does not specify distinct, predefined dataset splits in this context.
Hardware Specification	No	The paper mentions 'GPU acceleration' in the context of GPy Torch, but it does not provide specific details about the GPU models, CPU models, or other hardware specifications used for running the experiments. It only states 'Our HOT-GP implementation incurs longer wall clock runtimes due to computational cost of GP training.'
Software Dependencies	No	We implement our learning framework with MBRL-Lib (Pineda et al., 2021) for the Mu Jo Co tasks and Torch RL (Bou et al., 2024) for the coverage task. Additionally, we use GPy Torch (Gardner et al., 2018) to build the GP reward-dynamics model. We use the Adam optimizer in all cases. For further information on our implementation, please see our open-sourced implementation at this repository: https://github.com/jbayrooti/hot_gp. While the paper lists several software libraries and frameworks, it does not provide specific version numbers for any of them. The years in parentheses refer to the publication year of the respective papers, not the software versions used in this work.
Experiment Setup	Yes	Table 1: Task-specific hyperparameter values Hyperparameter Half-Cheetah Reacher Pusher Sparse Reacher Coverage Environment steps N 250,000 20,000 20,000 37,500 200,000 Steps per rollout T 1000 150 150 150 150 Model rollouts M adaptive adaptive adaptive adaptive 150 Model rollout steps K 1 1 1 1 1 Batch size B 256 256 256 256 150 Discount factor γ 0.99 0.99 0.99 0.99 0.9 Learning rate α 0.001 0.001 0.001 0.001 0.00005 Mixing factor τ 0.005 0.005 0.005 0.005 0.005 Replay buffer size unlimited unlimited unlimited unlimited 20,000 In H-UCRL, we use the exploration-exploitation coefficient β = 0.01 and number of samples Z = 5. In MBPO and optimistic MBPO, we use an ensemble of size 7. For DDPG, we use an exploration noise σexplore from 1 to 0.1. For GP-based approaches, we use the Matern kernel as the covariance function and learn from 100 inducing points. For the Mu Jo Co tasks, we use a 4-layer MLP with 200 hidden units per layer and Si LU activation function. For the VMAS coverage task we use a 2-layer MLP with 200 hidden units per layer and Mish activation function.