reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Proximal Policy Distillation

Authors: Giacomo Spigler

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: https: //github.com/spiglerg/sb3_distill.
Researcher Affiliation	Academia	Giacomo Spigler EMAIL AI for Robotics Lab (AIR-Lab) Department of Cognitive Science and Artificial Intelligence Tilburg University
Pseudocode	Yes	The full algorithm is reported in Appendix A (algorithm 1). We include full algorithm listings for the three distillation methods compared in this work. PPD is shown in Algorithm 1, student-distill in Algorithm 2, and teacher-distill in Algorithm 3.
Open Source Code	Yes	The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: https: //github.com/spiglerg/sb3_distill.
Open Datasets	Yes	Evaluation was performed on environments from Atari, Mujoco and Procgen because they span three axes that most affect performance in reinforcement learning and policy distillation: (i) state-space complexity low-dimensional states (Mujoco) vs. high-dimensional pixel observations (Atari & Procgen); (ii) action spaces discrete (Atari & Procgen) vs. continuous (Mujoco); (iii) out-of-distribution generalization identical train/test environment (Atari & Mujoco) vs. procedurally generated train/test splits (Procgen).
Dataset Splits	Yes	Evaluation was executed in a test setting (which in the case of procgen corresponds to using a different set of levels), where the distilled students were used to interact with the environment. Actions were chosen deterministically, instead of the stochastic action selection used during training, except for procgen , where we observed that deterministic policies were prone to getting stuck, leading to lower performance for all agents. Results for Atari environments are reported as human-normalized scores, using base values from Badia et al. (2020).
Hardware Specification	No	This research was supported by SURF grant EINF-5635. We gratefully acknowledge SURF (https://www.surf.nl) for providing access to the National Supercomputer Snellius.
Software Dependencies	No	The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: https: //github.com/spiglerg/sb3_distill.
Experiment Setup	Yes	Full details of the training procedure, hyperparameters, and network architectures are provided in Appendix A.2. The PPO hyperparameters are shown in Table 3. The hyperparameters of PPD related to PPO were the same as for the teacher training, except we used γ = 0.999 during distillation (γ = 0.995 for swimmer and hopper), end_coef=0, and shorter rollout trajectories (n_steps=64 for PPD, and n_steps=5 for student-distill and teacher-distill).