On Efficient Bayesian Exploration in Model-Based Reinforcement Learning

Authors: Alberto Caron, Vasilios Mavroudis, Chris Hicks

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks.
Researcher Affiliation Academia Alberto Caron EMAIL The Alan Turing Institute London, UK Chris Hicks EMAIL The Alan Turing Institute London, UK Vasilios Mavroudis EMAIL The Alan Turing Institute London, UK
Pseudocode Yes Algorithm 1 Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE)
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository. It focuses solely on describing the methodology and experimental results.
Open Datasets Yes Point Maze Environments This set of experiments features four types of 2D Gymnasium Robotics Point Maze structures (Fu et al., 2020) of increasing complexity. Ant Maze Environments The final set of experiments involves two different maze environments from the Gymnasium Robotics suite, featuring higher-dimensional state-action spaces. As with the Point Maze tasks, these Ant Maze environments were originally introduced in Fu et al. (2020).
Dataset Splits No The paper mentions experiments on environments like 'Mountain Car', 'Unichain', 'Point Maze', and 'Ant Maze', which are common in RL literature and typically have predefined dynamics rather than explicit dataset splits. No information about training, validation, or test splits for any dataset is provided, only 'cumulative fraction of visited states' or 'average returns' over '20 different replications'.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. It only describes the algorithms and environments.
Software Dependencies No The paper mentions specific software components like "Proximal Policy Optimization (PPO) (Schulman et al., 2017)", "Soft Actor-Critic (SAC) (Haarnoja et al., 2018)", and "Gymnasium Robotics Point Maze structures (Fu et al., 2020)", but it does not specify any version numbers for these or other libraries/frameworks used, which is necessary for reproducibility.
Experiment Setup Yes For the PTS-BE specifications, we use an horizon of J = 100 steps ahead and K = 10 independent rollouts. For the PTS-BE specification, we use an horizon of J = 64 and K = 16 independent rollouts. Throughout the experiments in this work, we consider a deep ensemble of 5/7/10 neural networks, as we did not notice any appreciable improvement for ensembles larger than 10. We use a continuous representation of the discrete states as in Osband et al. (2016) and Shyam et al. (2019). We generally employ vanilla versions of these algorithms [PPO, SAC], and versions that are augmented with intrinsic curiosity.