reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Double Horizon Model-Based Policy Optimization

Authors: Akihiro Kubo, Paavo Parmas, Shin Ishii

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments on continuous-control benchmarks (Section 4), we demonstrate that DHMBPO not only surpasses existing MBRL methods in sample efficiency but also achieves lower runtime due to a reduced UTD ratio (Hiraoka et al., 2022). Notably, DHMBPO achieved comparable sample efficiency to the state-of-the-art MACURA (Frauenknecht et al., 2024) algorithm on the Gymnasium (Towers et al., 2023) tasks while requiring only one-sixteenth of the runtime on average, all using a shared set of hyperparameters (see A).
Researcher Affiliation	Academia	Akihiro Kubo1,2, EMAIL 1 Advanced Telecommunications Research Institute 2 Kyoto University Paavo Parmas3 EMAIL 3 The University of Tokyo Shin Ishii1,2 EMAIL 1 Advanced Telecommunications Research Institute 2 Kyoto University
Pseudocode	Yes	Algorithm 1 Double Horizon Model-Based Policy Optimization
Open Source Code	Yes	Our code is available at https://github.com/4kubo/erl_lib.
Open Datasets	Yes	We evaluate the DHMBPO algorithm on a suite of Mu Jo Co-based (Emanuel et al., 2012) continuous control tasks from Gymnasium (GYM) (Towers et al., 2023) and DMControl (DMC) (Tunyasuvunakool et al., 2020).
Dataset Splits	Yes	Evaluation Protocol. After x environment steps, we measure the algorithm s performance using a test return. Specifically, the test return for DHMBPO is computed as the sample mean of the cumulative rewards over 10 episodes, whereas some other methods use fewer episodes (see Appendix B for details).
Hardware Specification	Yes	Each experiment was executed until 500K environment steps on a system configured with 8 NVIDIA RTX A4000 16GB GPUs.
Software Dependencies	No	The paper mentions specific environments/benchmarks like MuJoCo, Gymnasium, and DMControl, and also refers to optimizers like Adam W (Loshchilov and Hutter, 2019) and methods like Layer Normalization (Ba et al., 2016) and Dropout (Srivastava et al., 2014), but it does not provide specific version numbers for software libraries, programming languages, or other tools used for implementation.
Experiment Setup	Yes	Table 2: Hyperparameters commonly set for DHMBPO and SAC across all experiments. The first half of the table presents the hyperparameters shared between DHMBPO and SAC. Hyper-parameter Value Discount factor 0.995 Seed steps 5000 Action repeat 1 (Gym) 2 (DMControl) Batch size 256 Update-to-data ratio 1 Replay buffer size 1M Learning rate for the actor, critics and α 3 10 4 Initial value of α 0.1 Momentum coefficient c for target critic 0.995 Ensemble size of critic 5 Length of DR D 20 Length of training rollout T 5 Iteration per DR 20 Ensemble size of model 8 Optimizer for training model Adam W (Loshchilov and Hutter, 2019) Learning rate for model 1 10 3