reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

Authors: Abdullah Akgül, Gulcin Baykal, Manuel Haussmann, Melih Kandemir

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return. Figure 1 illustrates the learning profiles of on-policy deep actor-critics in a continuous control task with non-stationary dynamics.
Researcher Affiliation	Academia	Abdullah Akgül EMAIL Department of Mathematics and Computer Science University of Southern Denmark; Gulcin Baykal EMAIL Department of Mathematics and Computer Science University of Southern Denmark; Manuel Haußmann EMAIL Department of Mathematics and Computer Science University of Southern Denmark; Melih Kandemir EMAIL Department of Mathematics and Computer Science University of Southern Denmark
Pseudocode	Yes	We provide pseudocode in Algorithm 1 illustrating how to implement EPPO variants by overlaying color-coded modifications on top of a standard PPO implementation, where each color corresponds to a specific EPPO variant.
Open Source Code	Yes	The implementation of the EPPO variants and the full experimental pipeline is available at https://github.com/adinlab/EPPO.
Open Datasets	Yes	We run our simulations on the Ant and Half Cheetah environments using the v5 versions of Mu Jo Co environments (Todorov et al., 2012). For further details on the experimental pipeline and hyperparameters, see Section B. The implementation of the EPPO variants and the full experimental pipeline is available at https://github.com/adinlab/EPPO.
Dataset Splits	No	The paper uses reinforcement learning environments (Ant and Half Cheetah from MuJoCo) where the agent interacts with the environment to generate data. It describes training steps and evaluation episodes ('We train EPPO for 500 000 steps per task... using 10 evaluation episodes') but not a static dataset that is split into training/test/validation sets in the traditional supervised learning sense. While it describes how tasks change, it does not detail splits of a fixed dataset.
Hardware Specification	Yes	We perform our experiments using two computers equipped with Ge Force RTX 4090 GPUs, an Intel(R) Core(TM) i7-14700K CPU running at 5.6 GHz, and 96 GB of memory.
Software Dependencies	No	The paper mentions several algorithms and concepts by their authors and year (e.g., 'Adam (Kingma & Ba, 2015)', 'Layer Normalization (Ba et al., 2016)', 'Re LU activations (Nair & Hinton, 2010)'), and the use of 'Mu Jo Co environments (Todorov et al., 2012)'. However, it does not provide specific version numbers for software libraries or programming languages used in the implementation, beyond mentioning 'v5' for MuJoCo environments.
Experiment Setup	Yes	We list the hyperparameters for the experimental pipeline in Table 12. Training: Seeds [1, 2, . . . , 15] Number of steps per task 500 000 Learning rate for actor and critic 0.0003 Horizon 2048 Number of epochs 10 Minibatch size 256 Clip rate ϵ 0.2 GAE parameter λ 0.95 Hidden dimensions of actor and critic [256, 256] Activation functions of actor and critic Re LU Normalization layers of actor and critic Layer Norm Optimizer for actor and critic Adam Discount factor γ 0.99 Maximum gradient norm 0.5