Double Horizon Model-Based Policy Optimization

Authors: Akihiro Kubo, Paavo Parmas, Shin Ishii

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on continuous-control benchmarks (Section 4), we demonstrate that DHMBPO not only surpasses existing MBRL methods in sample efficiency but also achieves lower runtime due to a reduced UTD ratio (Hiraoka et al., 2022). Notably, DHMBPO achieved comparable sample efficiency to the state-of-the-art MACURA (Frauenknecht et al., 2024) algorithm on the Gymnasium (Towers et al., 2023) tasks while requiring only one-sixteenth of the runtime on average, all using a shared set of hyperparameters (see A).
Researcher Affiliation Academia Akihiro Kubo1,2, EMAIL 1 Advanced Telecommunications Research Institute 2 Kyoto University Paavo Parmas3 EMAIL 3 The University of Tokyo Shin Ishii1,2 EMAIL 1 Advanced Telecommunications Research Institute 2 Kyoto University
Pseudocode Yes Algorithm 1 Double Horizon Model-Based Policy Optimization
Open Source Code Yes Our code is available at https://github.com/4kubo/erl_lib.
Open Datasets Yes We evaluate the DHMBPO algorithm on a suite of Mu Jo Co-based (Emanuel et al., 2012) continuous control tasks from Gymnasium (GYM) (Towers et al., 2023) and DMControl (DMC) (Tunyasuvunakool et al., 2020).
Dataset Splits Yes Evaluation Protocol. After x environment steps, we measure the algorithm s performance using a test return. Specifically, the test return for DHMBPO is computed as the sample mean of the cumulative rewards over 10 episodes, whereas some other methods use fewer episodes (see Appendix B for details).
Hardware Specification Yes Each experiment was executed until 500K environment steps on a system configured with 8 NVIDIA RTX A4000 16GB GPUs.
Software Dependencies No The paper mentions specific environments/benchmarks like MuJoCo, Gymnasium, and DMControl, and also refers to optimizers like Adam W (Loshchilov and Hutter, 2019) and methods like Layer Normalization (Ba et al., 2016) and Dropout (Srivastava et al., 2014), but it does not provide specific version numbers for software libraries, programming languages, or other tools used for implementation.
Experiment Setup Yes Table 2: Hyperparameters commonly set for DHMBPO and SAC across all experiments. The first half of the table presents the hyperparameters shared between DHMBPO and SAC. Hyper-parameter Value Discount factor 0.995 Seed steps 5000 Action repeat 1 (Gym) 2 (DMControl) Batch size 256 Update-to-data ratio 1 Replay buffer size 1M Learning rate for the actor, critics and α 3 10 4 Initial value of α 0.1 Momentum coefficient c for target critic 0.995 Ensemble size of critic 5 Length of DR D 20 Length of training rollout T 5 Iteration per DR 20 Ensemble size of model 8 Optimizer for training model Adam W (Loshchilov and Hutter, 2019) Learning rate for model 1 10 3