Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling
Authors: Jasmine Bayrooti, Carl Ek, Amanda Prorok
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration. We evaluate the performance of our proposed algorithm, HOT-GP, on continuous state-action control tasks by measuring the average sum of rewards accumulated during evaluation episodes. First, we compare these approaches on standard Mu Jo Co benchmark tasks (Todorov et al., 2012) and extended Mu Jo Co sparse maze tasks from the D4RL suite (Fu et al., 2020) against additional modelfree and model-based methods. Then we evaluate these approaches on a practical robotics task using the VMAS simulator (Bettini et al., 2022) and provide an analysis of the individual components. |
| Researcher Affiliation | Academia | Jasmine Bayrooti University of Cambridge EMAIL Carl Henrik Ek University of Cambridge Karolinska Institutet EMAIL Amanda Prorok University of Cambridge EMAIL |
| Pseudocode | Yes | Algorithm 1 Model-based Policy Optimization 1: Require: max environment steps N, model rollouts M, steps per rollout T, steps per model rollout K, initial state distribution d(s0), policy-learning algorithm Policy Search 2: Initialize: policy π, reward-dynamics model p(f), environment dataset Denv 3: while |Denv| < N do 4: /*Simulate Data*/ 5: Initialize model dataset Dmodel = 6: for m = 1, 2, . . . , M do 7: Sample ˆs0 uniformly from Denv 8: for k = 1, . . . , K do 9: Compute action ˆak from π(ˆsk) 10: Select next state ˆsk+1 and reward ˆrk using p(f | Denv) This is algorithm-specific 11: Append transition to buffer (ˆsk, ˆak, ˆsk+1, ˆrk) Dmodel 12: /*Optimize Policy*/ 13: π Policy Search(π, Dmodel) 14: /*Optimize Dynamics Model*/ 15: Start from initial state s0 d(s0) 16: for t = 1, 2, . . . , T do 17: Compute action at from π(st) 18: Observe next state and reward st+1, rt = f(st, at) 19: Append transition to buffer (st, at, st+1, rt) Denv 20: Retrain model p(f) using Denv |
| Open Source Code | Yes | code for reproducing the experiments can be found at https://github.com/jbayrooti/hot_gp. |
| Open Datasets | Yes | We apply our method on a set of Mu Jo Co and VMAS continuous control tasks. We evaluate this approach on a set of Mu Jo Co (Todorov et al., 2012) and VMAS (Bettini et al., 2022) continuous control tasks. First, we compare these approaches on standard Mu Jo Co benchmark tasks (Todorov et al., 2012) and extended Mu Jo Co sparse maze tasks from the D4RL suite (Fu et al., 2020) against additional modelfree and model-based methods. |
| Dataset Splits | No | The paper describes an online reinforcement learning setup where data is collected through interactions with the environment (Mu Jo Co, VMAS). It refers to an 'environment dataset Denv' and 'model dataset Dmodel' which are dynamically updated. This is not a traditional supervised learning scenario where static training, validation, and test splits are predefined and explicitly mentioned with percentages or counts. The paper does not specify distinct, predefined dataset splits in this context. |
| Hardware Specification | No | The paper mentions 'GPU acceleration' in the context of GPy Torch, but it does not provide specific details about the GPU models, CPU models, or other hardware specifications used for running the experiments. It only states 'Our HOT-GP implementation incurs longer wall clock runtimes due to computational cost of GP training.' |
| Software Dependencies | No | We implement our learning framework with MBRL-Lib (Pineda et al., 2021) for the Mu Jo Co tasks and Torch RL (Bou et al., 2024) for the coverage task. Additionally, we use GPy Torch (Gardner et al., 2018) to build the GP reward-dynamics model. We use the Adam optimizer in all cases. For further information on our implementation, please see our open-sourced implementation at this repository: https://github.com/jbayrooti/hot_gp. While the paper lists several software libraries and frameworks, it does not provide specific version numbers for any of them. The years in parentheses refer to the publication year of the respective papers, not the software versions used in this work. |
| Experiment Setup | Yes | Table 1: Task-specific hyperparameter values Hyperparameter Half-Cheetah Reacher Pusher Sparse Reacher Coverage Environment steps N 250,000 20,000 20,000 37,500 200,000 Steps per rollout T 1000 150 150 150 150 Model rollouts M adaptive adaptive adaptive adaptive 150 Model rollout steps K 1 1 1 1 1 Batch size B 256 256 256 256 150 Discount factor γ 0.99 0.99 0.99 0.99 0.9 Learning rate α 0.001 0.001 0.001 0.001 0.00005 Mixing factor τ 0.005 0.005 0.005 0.005 0.005 Replay buffer size unlimited unlimited unlimited unlimited 20,000 In H-UCRL, we use the exploration-exploitation coefficient β = 0.01 and number of samples Z = 5. In MBPO and optimistic MBPO, we use an ensemble of size 7. For DDPG, we use an exploration noise σexplore from 1 to 0.1. For GP-based approaches, we use the Matern kernel as the covariance function and learn from 100 inducing points. For the Mu Jo Co tasks, we use a 4-layer MLP with 200 hidden units per layer and Si LU activation function. For the VMAS coverage task we use a 2-layer MLP with 200 hidden units per layer and Mish activation function. |