Simple Policy Optimization
Authors: Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, Renjing Xu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end. Section 6. Experiments: We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks. |
| Researcher Affiliation | Academia | 1The Hong Kong University of Science and Technology (Guangzhou) 2Beijing Innovation Center of Humanoid Robotics 3ETH Zurich. |
| Pseudocode | Yes | Algorithm 1 Simple Policy Optimization (SPO) |
| Open Source Code | Yes | Code is available at Simple-Policy-Optimization. |
| Open Datasets | Yes | We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks. |
| Dataset Splits | No | normalized(score) = score min / max min , where max and min represent the maximum and minimum validation returns of PPO-Clip during training, respectively. (Section 6.1) No specific percentages or methodologies for splitting datasets into train/validation/test sets are provided. |
| Hardware Specification | No | The paper mentions "GPU-based physics simulation" in a citation (Isaac Gym: Makoviychuk et al., 2021) as a tool used by others, but does not provide specific hardware details (GPU models, CPU types, etc.) used for the authors' own experiments. |
| Software Dependencies | No | In all our experiments, we utilize the RL library Gymnasium (Towers et al., 2024), which serves as a central abstraction to ensure broad interoperability between benchmark environments and training algorithms. While Gymnasium is mentioned, no specific version number is provided for it or any other software libraries, frameworks, or solvers used in the implementation. |
| Experiment Setup | Yes | Table 2. Detailed hyperparameters used in SPO. This table lists various hyperparameters for both Atari 2600 and Mu Jo Co environments, including Number of workers, Horizon, Learning rate, Optimizer (Adam), Total steps, Batch size, Update epochs, Mini-batches, Mini-batch size, GAE parameter λ, Discount factor γ, Value loss coefficient c1, Entropy loss coefficient c2, and Probability ratio hyperparameter ϵ. |