Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Authors: Abdullah Akgül, Gulcin Baykal, Manuel Haussmann, Melih Kandemir
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return. Figure 1 illustrates the learning profiles of on-policy deep actor-critics in a continuous control task with non-stationary dynamics. |
| Researcher Affiliation | Academia | Abdullah Akgül EMAIL Department of Mathematics and Computer Science University of Southern Denmark; Gulcin Baykal EMAIL Department of Mathematics and Computer Science University of Southern Denmark; Manuel Haußmann EMAIL Department of Mathematics and Computer Science University of Southern Denmark; Melih Kandemir EMAIL Department of Mathematics and Computer Science University of Southern Denmark |
| Pseudocode | Yes | We provide pseudocode in Algorithm 1 illustrating how to implement EPPO variants by overlaying color-coded modifications on top of a standard PPO implementation, where each color corresponds to a specific EPPO variant. |
| Open Source Code | Yes | The implementation of the EPPO variants and the full experimental pipeline is available at https://github.com/adinlab/EPPO. |
| Open Datasets | Yes | We run our simulations on the Ant and Half Cheetah environments using the v5 versions of Mu Jo Co environments (Todorov et al., 2012). For further details on the experimental pipeline and hyperparameters, see Section B. The implementation of the EPPO variants and the full experimental pipeline is available at https://github.com/adinlab/EPPO. |
| Dataset Splits | No | The paper uses reinforcement learning environments (Ant and Half Cheetah from MuJoCo) where the agent interacts with the environment to generate data. It describes training steps and evaluation episodes ('We train EPPO for 500 000 steps per task... using 10 evaluation episodes') but not a static dataset that is split into training/test/validation sets in the traditional supervised learning sense. While it describes how tasks change, it does not detail splits of a fixed dataset. |
| Hardware Specification | Yes | We perform our experiments using two computers equipped with Ge Force RTX 4090 GPUs, an Intel(R) Core(TM) i7-14700K CPU running at 5.6 GHz, and 96 GB of memory. |
| Software Dependencies | No | The paper mentions several algorithms and concepts by their authors and year (e.g., 'Adam (Kingma & Ba, 2015)', 'Layer Normalization (Ba et al., 2016)', 'Re LU activations (Nair & Hinton, 2010)'), and the use of 'Mu Jo Co environments (Todorov et al., 2012)'. However, it does not provide specific version numbers for software libraries or programming languages used in the implementation, beyond mentioning 'v5' for MuJoCo environments. |
| Experiment Setup | Yes | We list the hyperparameters for the experimental pipeline in Table 12. Training: Seeds [1, 2, . . . , 15] Number of steps per task 500 000 Learning rate for actor and critic 0.0003 Horizon 2048 Number of epochs 10 Minibatch size 256 Clip rate ϵ 0.2 GAE parameter λ 0.95 Hidden dimensions of actor and critic [256, 256] Activation functions of actor and critic Re LU Normalization layers of actor and critic Layer Norm Optimizer for actor and critic Adam Discount factor γ 0.99 Maximum gradient norm 0.5 |