reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Stay Hungry, Keep Learning: Sustainable Plasticity for Deep Reinforcement Learning

Authors: Huaicheng Zhou, Zifeng Zhuang, Donglin Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the approach effectively maintains policy plasticity and improves sample efficiency in reinforcement learning. Extensive experiments were conducted across diverse environments, including Mu Jo Co (Todorov et al., 2012), Deep Mind Control Suite (Tassa et al., 2018), and a specially designed Mu Jo Co variant called Cycle Friction that tests adaptation to changing dynamics. Results demonstrate that P3O consistently outperforms standard PPO, achieving both higher average returns and more stable learning curves, validating the effectiveness of our neuron regeneration mechanism in maintaining policy plasticity while enhancing sample efficiency.
Researcher Affiliation	Academia	1School of Engineering, Westlake University, Hangzhou, China. Correspondence to: Donglin Wang <EMAIL>.
Pseudocode	Yes	A.3.1. SUSTAINABLE BACKUP PROPAGATION Algorithm 1 Sustainable Backup Propagation (SBP) Neural Network fθ, Temporary Model ftmp, Reset Rate γ, Training Steps T, Reset Frequency F, Reset Index p = 0, Distillation Threshold τ, Distillation Loss d = None. A.3.2. PLASTIC PROXIMAL POLICY OPTIMIZATION Algorithm 2 Plastic PPO(P3O) Policy πθ, Temporary Policy πtmp, Reset Rate γ, Training Steps T, Reset Frequency F, Reset Index p = 0, Distillation Threshold τ, Distillation Loss d = None.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	Environment & Task To evaluate our algorithm s performance, we employed a diverse set of tasks. These include standard benchmarks from Mu Jo Co (Todorov et al., 2012) and the state-based versions of Deep Mind Control Suite (DMC) (Tassa et al., 2018). Additionally, we introduce the Cycle Friction Control task, an innovative variant of the Mu Jo Co environment inspired by the slip Mu Jo Co task (Dohare et al., 2024).
Dataset Splits	No	The paper mentions that experiments were conducted with 5 different random seeds and results are mean values with standard deviations, and that PPO is utilized to update policies during interaction with the environment (on-policy learning). However, it does not explicitly describe train/test/validation splits for the datasets or environments, as typically defined in supervised learning contexts for reproducibility of data partitioning.
Hardware Specification	Yes	In our experiments, we utilized a machine equipped with an NVIDIA V100 (32GB) GPU to measure the update time for the PPO, which averaged approximately 0.30 seconds per update epoch.
Software Dependencies	Yes	Python 3.8 Pytorch 2.0.1 (Paszke et al., 2019) Gym 0.23.1 (Brockman et al., 2016) Mu Jo Co 2.3.7 (Todorov et al., 2012) mujoco-py 2.1.2.14
Experiment Setup	Yes	Table 2. Algorithm Parameters Category Hyperparameter Value Optimizer Adam (Kingma & Ba, 2014) Learning Rate (Actor & Critic) 3e-4 Online Replay Buffer Size 8192 Mini-batch Size 256 Discount Factor 0.99 Training Step 1.5e7 Epochs per Update 10 Clip Range 0.2 Clip Grad Norm 0.5 Architecture Actor & Critic Hidden Size 256 Actor & Critic Hidden Layers 3 Actor & Critic Activation Function Tanh Reset Rate 0.01 Reset Frequency 50000 Environment Step Neuron Utility Type Neuron Lifetime DKL α 0.4 Distillation Loss Bound τ 0.01