Double Horizon Model-Based Policy Optimization
Authors: Akihiro Kubo, Paavo Parmas, Shin Ishii
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on continuous-control benchmarks (Section 4), we demonstrate that DHMBPO not only surpasses existing MBRL methods in sample efficiency but also achieves lower runtime due to a reduced UTD ratio (Hiraoka et al., 2022). Notably, DHMBPO achieved comparable sample efficiency to the state-of-the-art MACURA (Frauenknecht et al., 2024) algorithm on the Gymnasium (Towers et al., 2023) tasks while requiring only one-sixteenth of the runtime on average, all using a shared set of hyperparameters (see A). |
| Researcher Affiliation | Academia | Akihiro Kubo1,2, EMAIL 1 Advanced Telecommunications Research Institute 2 Kyoto University Paavo Parmas3 EMAIL 3 The University of Tokyo Shin Ishii1,2 EMAIL 1 Advanced Telecommunications Research Institute 2 Kyoto University |
| Pseudocode | Yes | Algorithm 1 Double Horizon Model-Based Policy Optimization |
| Open Source Code | Yes | Our code is available at https://github.com/4kubo/erl_lib. |
| Open Datasets | Yes | We evaluate the DHMBPO algorithm on a suite of Mu Jo Co-based (Emanuel et al., 2012) continuous control tasks from Gymnasium (GYM) (Towers et al., 2023) and DMControl (DMC) (Tunyasuvunakool et al., 2020). |
| Dataset Splits | Yes | Evaluation Protocol. After x environment steps, we measure the algorithm s performance using a test return. Specifically, the test return for DHMBPO is computed as the sample mean of the cumulative rewards over 10 episodes, whereas some other methods use fewer episodes (see Appendix B for details). |
| Hardware Specification | Yes | Each experiment was executed until 500K environment steps on a system configured with 8 NVIDIA RTX A4000 16GB GPUs. |
| Software Dependencies | No | The paper mentions specific environments/benchmarks like MuJoCo, Gymnasium, and DMControl, and also refers to optimizers like Adam W (Loshchilov and Hutter, 2019) and methods like Layer Normalization (Ba et al., 2016) and Dropout (Srivastava et al., 2014), but it does not provide specific version numbers for software libraries, programming languages, or other tools used for implementation. |
| Experiment Setup | Yes | Table 2: Hyperparameters commonly set for DHMBPO and SAC across all experiments. The first half of the table presents the hyperparameters shared between DHMBPO and SAC. Hyper-parameter Value Discount factor 0.995 Seed steps 5000 Action repeat 1 (Gym) 2 (DMControl) Batch size 256 Update-to-data ratio 1 Replay buffer size 1M Learning rate for the actor, critics and α 3 10 4 Initial value of α 0.1 Momentum coefficient c for target critic 0.995 Ensemble size of critic 5 Length of DR D 20 Length of training rollout T 5 Iteration per DR 20 Ensemble size of model 8 Optimizer for training model Adam W (Loshchilov and Hutter, 2019) Learning rate for model 1 10 3 |