reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Simple Policy Optimization

Authors: Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, Renjing Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end. Section 6. Experiments: We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks.
Researcher Affiliation	Academia	1The Hong Kong University of Science and Technology (Guangzhou) 2Beijing Innovation Center of Humanoid Robotics 3ETH Zurich.
Pseudocode	Yes	Algorithm 1 Simple Policy Optimization (SPO)
Open Source Code	Yes	Code is available at Simple-Policy-Optimization.
Open Datasets	Yes	We report results on the Atari 2600 (Bellemare et al., 2013; Machado et al., 2018) and Mu Jo Co (Todorov et al., 2012) benchmarks.
Dataset Splits	No	normalized(score) = score min / max min , where max and min represent the maximum and minimum validation returns of PPO-Clip during training, respectively. (Section 6.1) No specific percentages or methodologies for splitting datasets into train/validation/test sets are provided.
Hardware Specification	No	The paper mentions "GPU-based physics simulation" in a citation (Isaac Gym: Makoviychuk et al., 2021) as a tool used by others, but does not provide specific hardware details (GPU models, CPU types, etc.) used for the authors' own experiments.
Software Dependencies	No	In all our experiments, we utilize the RL library Gymnasium (Towers et al., 2024), which serves as a central abstraction to ensure broad interoperability between benchmark environments and training algorithms. While Gymnasium is mentioned, no specific version number is provided for it or any other software libraries, frameworks, or solvers used in the implementation.
Experiment Setup	Yes	Table 2. Detailed hyperparameters used in SPO. This table lists various hyperparameters for both Atari 2600 and Mu Jo Co environments, including Number of workers, Horizon, Learning rate, Optimizer (Adam), Total steps, Batch size, Update epochs, Mini-batches, Mini-batch size, GAE parameter λ, Discount factor γ, Value loss coefficient c1, Entropy loss coefficient c2, and Probability ratio hyperparameter ϵ.