Simple Policy Optimization

Authors: Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, Renjing Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end. Section 6. Experiments: We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks.
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology (Guangzhou) 2Beijing Innovation Center of Humanoid Robotics 3ETH Zurich.
Pseudocode Yes Algorithm 1 Simple Policy Optimization (SPO)
Open Source Code Yes Code is available at Simple-Policy-Optimization.
Open Datasets Yes We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks.
Dataset Splits No normalized(score) = score min / max min , where max and min represent the maximum and minimum validation returns of PPO-Clip during training, respectively. (Section 6.1) No specific percentages or methodologies for splitting datasets into train/validation/test sets are provided.
Hardware Specification No The paper mentions "GPU-based physics simulation" in a citation (Isaac Gym: Makoviychuk et al., 2021) as a tool used by others, but does not provide specific hardware details (GPU models, CPU types, etc.) used for the authors' own experiments.
Software Dependencies No In all our experiments, we utilize the RL library Gymnasium (Towers et al., 2024), which serves as a central abstraction to ensure broad interoperability between benchmark environments and training algorithms. While Gymnasium is mentioned, no specific version number is provided for it or any other software libraries, frameworks, or solvers used in the implementation.
Experiment Setup Yes Table 2. Detailed hyperparameters used in SPO. This table lists various hyperparameters for both Atari 2600 and Mu Jo Co environments, including Number of workers, Horizon, Learning rate, Optimizer (Adam), Total steps, Batch size, Update epochs, Mini-batches, Mini-batch size, GAE parameter λ, Discount factor γ, Value loss coefficient c1, Entropy loss coefficient c2, and Probability ratio hyperparameter ϵ.