OptionZero: Planning with Learned Options

Authors: Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments conducted in 26 Atari games demonstrate that Option Zero outperforms Mu Zero, achieving a 131.58% improvement in mean human-normalized score.
Researcher Affiliation Academia 1Institute of Information Science, Academia Sinica, Taiwan 2Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan
Pseudocode No The paper describes the modifications to MCTS in section 4.2 using prose and mathematical equations but does not include a distinct, labeled pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero. The source code, scripts for processing behavior analysis, and trained models are available at https://rlg.iis.sinica.edu.tw/papers/optionzero.
Open Datasets Yes We conduct experiments on Atari games, which are visually complex environments with relatively small frame differences between states, making them suitable for learning options.
Dataset Splits No The paper mentions training on 'Atari games' and using a 'self-play process [that] collects game trajectories' for training. It does not explicitly define traditional training, validation, and test splits for these games or for the Grid World environment.
Hardware Specification Yes The experiments are conducted on machines with 24 CPU cores and four NVIDIA GTX 1080 Ti GPUs.
Software Dependencies No The paper states, 'Our Option Zero implementation, which is built upon a publicly available Mu Zero framework (Wu et al., 2025).' However, it does not specify version numbers for any software libraries or dependencies (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes Detailed experiment setups are provided in Appendix B. In this section, we describe the details for training Option Zero models used in the experiments. The experiments are conducted on machines with 24 CPU cores and four NVIDIA GTX 1080 Ti GPUs. For the training configurations, we generally follow those in Mu Zero, where the hyperparameters are listed in Table 4. Table 4: Hyperparameters for training. Parameter: Optimizer SGD, Optimizer: learning rate 0.1, Optimizer: momentum 0.9, Optimizer: weight decay 0.0001, Discount factor 0.997, Priority exponent (α) 1, Priority correction (β) 0.4, Bootstrap step (n-step return) 5, MCTS simulation 50, Softmax temperature 1, Frames skip 4, Frames stacked 4, Iteration 300 400, Training steps 60k 80k, Batch size 512 1024, # Blocks 2 1, Replay buffer size 1M frames 8k games, Max frames per episode 108k, Dirichlet noise ratio 0.25 0.3.