Proximal Policy Distillation

Authors: Giacomo Spigler

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: https: //github.com/spiglerg/sb3_distill.
Researcher Affiliation Academia Giacomo Spigler EMAIL AI for Robotics Lab (AIR-Lab) Department of Cognitive Science and Artificial Intelligence Tilburg University
Pseudocode Yes The full algorithm is reported in Appendix A (algorithm 1). We include full algorithm listings for the three distillation methods compared in this work. PPD is shown in Algorithm 1, student-distill in Algorithm 2, and teacher-distill in Algorithm 3.
Open Source Code Yes The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: https: //github.com/spiglerg/sb3_distill.
Open Datasets Yes Evaluation was performed on environments from Atari, Mujoco and Procgen because they span three axes that most affect performance in reinforcement learning and policy distillation: (i) state-space complexity low-dimensional states (Mujoco) vs. high-dimensional pixel observations (Atari & Procgen); (ii) action spaces discrete (Atari & Procgen) vs. continuous (Mujoco); (iii) out-of-distribution generalization identical train/test environment (Atari & Mujoco) vs. procedurally generated train/test splits (Procgen).
Dataset Splits Yes Evaluation was executed in a test setting (which in the case of procgen corresponds to using a different set of levels), where the distilled students were used to interact with the environment. Actions were chosen deterministically, instead of the stochastic action selection used during training, except for procgen , where we observed that deterministic policies were prone to getting stuck, leading to lower performance for all agents. Results for Atari environments are reported as human-normalized scores, using base values from Badia et al. (2020).
Hardware Specification No This research was supported by SURF grant EINF-5635. We gratefully acknowledge SURF (https://www.surf.nl) for providing access to the National Supercomputer Snellius.
Software Dependencies No The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: https: //github.com/spiglerg/sb3_distill.
Experiment Setup Yes Full details of the training procedure, hyperparameters, and network architectures are provided in Appendix A.2. The PPO hyperparameters are shown in Table 3. The hyperparameters of PPD related to PPO were the same as for the teacher training, except we used γ = 0.999 during distillation (γ = 0.995 for swimmer and hopper), end_coef=0, and shorter rollout trajectories (n_steps=64 for PPD, and n_steps=5 for student-distill and teacher-distill).