Stay Hungry, Keep Learning: Sustainable Plasticity for Deep Reinforcement Learning
Authors: Huaicheng Zhou, Zifeng Zhuang, Donglin Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the approach effectively maintains policy plasticity and improves sample efficiency in reinforcement learning. Extensive experiments were conducted across diverse environments, including Mu Jo Co (Todorov et al., 2012), Deep Mind Control Suite (Tassa et al., 2018), and a specially designed Mu Jo Co variant called Cycle Friction that tests adaptation to changing dynamics. Results demonstrate that P3O consistently outperforms standard PPO, achieving both higher average returns and more stable learning curves, validating the effectiveness of our neuron regeneration mechanism in maintaining policy plasticity while enhancing sample efficiency. |
| Researcher Affiliation | Academia | 1School of Engineering, Westlake University, Hangzhou, China. Correspondence to: Donglin Wang <EMAIL>. |
| Pseudocode | Yes | A.3.1. SUSTAINABLE BACKUP PROPAGATION Algorithm 1 Sustainable Backup Propagation (SBP) Neural Network fθ, Temporary Model ftmp, Reset Rate γ, Training Steps T, Reset Frequency F, Reset Index p = 0, Distillation Threshold τ, Distillation Loss d = None. A.3.2. PLASTIC PROXIMAL POLICY OPTIMIZATION Algorithm 2 Plastic PPO(P3O) Policy πθ, Temporary Policy πtmp, Reset Rate γ, Training Steps T, Reset Frequency F, Reset Index p = 0, Distillation Threshold τ, Distillation Loss d = None. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Environment & Task To evaluate our algorithm s performance, we employed a diverse set of tasks. These include standard benchmarks from Mu Jo Co (Todorov et al., 2012) and the state-based versions of Deep Mind Control Suite (DMC) (Tassa et al., 2018). Additionally, we introduce the Cycle Friction Control task, an innovative variant of the Mu Jo Co environment inspired by the slip Mu Jo Co task (Dohare et al., 2024). |
| Dataset Splits | No | The paper mentions that experiments were conducted with 5 different random seeds and results are mean values with standard deviations, and that PPO is utilized to update policies during interaction with the environment (on-policy learning). However, it does not explicitly describe train/test/validation splits for the datasets or environments, as typically defined in supervised learning contexts for reproducibility of data partitioning. |
| Hardware Specification | Yes | In our experiments, we utilized a machine equipped with an NVIDIA V100 (32GB) GPU to measure the update time for the PPO, which averaged approximately 0.30 seconds per update epoch. |
| Software Dependencies | Yes | Python 3.8 Pytorch 2.0.1 (Paszke et al., 2019) Gym 0.23.1 (Brockman et al., 2016) Mu Jo Co 2.3.7 (Todorov et al., 2012) mujoco-py 2.1.2.14 |
| Experiment Setup | Yes | Table 2. Algorithm Parameters Category Hyperparameter Value Optimizer Adam (Kingma & Ba, 2014) Learning Rate (Actor & Critic) 3e-4 Online Replay Buffer Size 8192 Mini-batch Size 256 Discount Factor 0.99 Training Step 1.5e7 Epochs per Update 10 Clip Range 0.2 Clip Grad Norm 0.5 Architecture Actor & Critic Hidden Size 256 Actor & Critic Hidden Layers 3 Actor & Critic Activation Function Tanh Reset Rate 0.01 Reset Frequency 50000 Environment Step Neuron Utility Type Neuron Lifetime DKL α 0.4 Distillation Loss Bound τ 0.01 |