reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Efficient Bayesian Exploration in Model-Based Reinforcement Learning

Authors: Alberto Caron, Vasilios Mavroudis, Chris Hicks

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks.
Researcher Affiliation	Academia	Alberto Caron EMAIL The Alan Turing Institute London, UK Chris Hicks EMAIL The Alan Turing Institute London, UK Vasilios Mavroudis EMAIL The Alan Turing Institute London, UK
Pseudocode	Yes	Algorithm 1 Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE)
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository. It focuses solely on describing the methodology and experimental results.
Open Datasets	Yes	Point Maze Environments This set of experiments features four types of 2D Gymnasium Robotics Point Maze structures (Fu et al., 2020) of increasing complexity. Ant Maze Environments The final set of experiments involves two different maze environments from the Gymnasium Robotics suite, featuring higher-dimensional state-action spaces. As with the Point Maze tasks, these Ant Maze environments were originally introduced in Fu et al. (2020).
Dataset Splits	No	The paper mentions experiments on environments like 'Mountain Car', 'Unichain', 'Point Maze', and 'Ant Maze', which are common in RL literature and typically have predefined dynamics rather than explicit dataset splits. No information about training, validation, or test splits for any dataset is provided, only 'cumulative fraction of visited states' or 'average returns' over '20 different replications'.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. It only describes the algorithms and environments.
Software Dependencies	No	The paper mentions specific software components like "Proximal Policy Optimization (PPO) (Schulman et al., 2017)", "Soft Actor-Critic (SAC) (Haarnoja et al., 2018)", and "Gymnasium Robotics Point Maze structures (Fu et al., 2020)", but it does not specify any version numbers for these or other libraries/frameworks used, which is necessary for reproducibility.
Experiment Setup	Yes	For the PTS-BE specifications, we use an horizon of J = 100 steps ahead and K = 10 independent rollouts. For the PTS-BE specification, we use an horizon of J = 64 and K = 16 independent rollouts. Throughout the experiments in this work, we consider a deep ensemble of 5/7/10 neural networks, as we did not notice any appreciable improvement for ensembles larger than 10. We use a continuous representation of the discrete states as in Osband et al. (2016) and Shyam et al. (2019). We generally employ vanilla versions of these algorithms [PPO, SAC], and versions that are augmented with intrinsic curiosity.