Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implemented PARL on a set of robotics and autonomous driving tasks, evaluated the obtained rewards and entropy rates, and compared against different baselines. For the Mu Jo Co hyperparameters, we took pretuned values from Raffin et al. (2019). Figure 2: Rewards and Entropy rates for PAPPO as a function of k. Table 1: Results for autonomous driving environments.
Researcher Affiliation Academia Daniel Jarne Ornia EMAIL University of Oxford Giannis Delimpaltadakis Eindhoven University of Technology Jens Kober Delft University of Technology Javier Alonso-Mora Delft University of Technology
Pseudocode Yes Algorithm 1 Predictability Aware Policy Gradient
Open Source Code Yes 2See the project repository https://github.com/tud-amr/parl for details.
Open Datasets Yes We implemented PARL on a set of robotics and autonomous driving tasks, evaluated the obtained rewards and entropy rates, and compared against different baselines. For the Mu Jo Co hyperparameters, we took pretuned values from Raffin et al. (2019). We test PAPPO in the Highway Environment (Leurent, 2018), where an agent learns to drive at a desired speed while navigating a crowded road with other autonomous agents. These are based on Minigrid environments Chevalier-Boisvert et al. (2023)
Dataset Splits No No explicit static dataset splits (e.g., train/validation/test percentages or counts) are provided. The paper discusses evaluating trained agents over 50 independent episodes or trajectories, which is typical for reinforcement learning where data is generated dynamically rather than being split from a fixed dataset.
Hardware Specification No The paper states: "All experiments were run in a single CPU, running Ubuntu 20.04". This description of a 'single CPU' is too general and lacks specific details such as the CPU model, core count, or clock speed, which are necessary for hardware reproducibility.
Software Dependencies No For our experiments, we took PPO and SAC parameters tuned from Stable-Baselines3 Raffin et al. (2019) and Haarnoja et al. (2018), and used automatic hyperparameter tuning Akiba et al. (2019) for model and predictability parameters. The paper mentions software frameworks like Stable-Baselines3 but does not provide specific version numbers for these or other key software libraries used in the implementation.
Experiment Setup No For the Mu Jo Co hyperparameters, we took pretuned values from Raffin et al. (2019). For the experiments using PASAC and explicit comparison against other SAC-based baselines including RPC (Eysenbach et al. 2021), see Appendix C.1. We train all agents using the same hyperparameters, and we only vary the trade-off k in the PARL agents to evaluate the influence. The paper defers specific hyperparameter values for MuJoCo to external work, and for its own method, it only specifies that 'k' is varied, and mentions a 'delay parameter' without providing its value. No concrete values for common hyperparameters like learning rates, batch sizes, or optimizer settings are given in the main text.