OptionZero: Planning with Learned Options
Authors: Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments conducted in 26 Atari games demonstrate that Option Zero outperforms Mu Zero, achieving a 131.58% improvement in mean human-normalized score. |
| Researcher Affiliation | Academia | 1Institute of Information Science, Academia Sinica, Taiwan 2Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan |
| Pseudocode | No | The paper describes the modifications to MCTS in section 4.2 using prose and mathematical equations but does not include a distinct, labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero. The source code, scripts for processing behavior analysis, and trained models are available at https://rlg.iis.sinica.edu.tw/papers/optionzero. |
| Open Datasets | Yes | We conduct experiments on Atari games, which are visually complex environments with relatively small frame differences between states, making them suitable for learning options. |
| Dataset Splits | No | The paper mentions training on 'Atari games' and using a 'self-play process [that] collects game trajectories' for training. It does not explicitly define traditional training, validation, and test splits for these games or for the Grid World environment. |
| Hardware Specification | Yes | The experiments are conducted on machines with 24 CPU cores and four NVIDIA GTX 1080 Ti GPUs. |
| Software Dependencies | No | The paper states, 'Our Option Zero implementation, which is built upon a publicly available Mu Zero framework (Wu et al., 2025).' However, it does not specify version numbers for any software libraries or dependencies (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | Detailed experiment setups are provided in Appendix B. In this section, we describe the details for training Option Zero models used in the experiments. The experiments are conducted on machines with 24 CPU cores and four NVIDIA GTX 1080 Ti GPUs. For the training configurations, we generally follow those in Mu Zero, where the hyperparameters are listed in Table 4. Table 4: Hyperparameters for training. Parameter: Optimizer SGD, Optimizer: learning rate 0.1, Optimizer: momentum 0.9, Optimizer: weight decay 0.0001, Discount factor 0.997, Priority exponent (α) 1, Priority correction (β) 0.4, Bootstrap step (n-step return) 5, MCTS simulation 50, Softmax temperature 1, Frames skip 4, Frames stacked 4, Iteration 300 400, Training steps 60k 80k, Batch size 512 1024, # Blocks 2 1, Replay buffer size 1M frames 8k games, Max frames per episode 108k, Dirichlet noise ratio 0.25 0.3. |