reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implemented PARL on a set of robotics and autonomous driving tasks, evaluated the obtained rewards and entropy rates, and compared against different baselines. For the Mu Jo Co hyperparameters, we took pretuned values from Raffin et al. (2019). Figure 2: Rewards and Entropy rates for PAPPO as a function of k. Table 1: Results for autonomous driving environments.
Researcher Affiliation	Academia	Daniel Jarne Ornia EMAIL University of Oxford Giannis Delimpaltadakis Eindhoven University of Technology Jens Kober Delft University of Technology Javier Alonso-Mora Delft University of Technology
Pseudocode	Yes	Algorithm 1 Predictability Aware Policy Gradient
Open Source Code	Yes	2See the project repository https://github.com/tud-amr/parl for details.
Open Datasets	Yes	We implemented PARL on a set of robotics and autonomous driving tasks, evaluated the obtained rewards and entropy rates, and compared against different baselines. For the Mu Jo Co hyperparameters, we took pretuned values from Raffin et al. (2019). We test PAPPO in the Highway Environment (Leurent, 2018), where an agent learns to drive at a desired speed while navigating a crowded road with other autonomous agents. These are based on Minigrid environments Chevalier-Boisvert et al. (2023)
Dataset Splits	No	No explicit static dataset splits (e.g., train/validation/test percentages or counts) are provided. The paper discusses evaluating trained agents over 50 independent episodes or trajectories, which is typical for reinforcement learning where data is generated dynamically rather than being split from a fixed dataset.
Hardware Specification	No	The paper states: "All experiments were run in a single CPU, running Ubuntu 20.04". This description of a 'single CPU' is too general and lacks specific details such as the CPU model, core count, or clock speed, which are necessary for hardware reproducibility.
Software Dependencies	No	For our experiments, we took PPO and SAC parameters tuned from Stable-Baselines3 Raffin et al. (2019) and Haarnoja et al. (2018), and used automatic hyperparameter tuning Akiba et al. (2019) for model and predictability parameters. The paper mentions software frameworks like Stable-Baselines3 but does not provide specific version numbers for these or other key software libraries used in the implementation.
Experiment Setup	No	For the Mu Jo Co hyperparameters, we took pretuned values from Raffin et al. (2019). For the experiments using PASAC and explicit comparison against other SAC-based baselines including RPC (Eysenbach et al. 2021), see Appendix C.1. We train all agents using the same hyperparameters, and we only vary the trade-off k in the PARL agents to evaluate the influence. The paper defers specific hyperparameter values for MuJoCo to external work, and for its own method, it only specifies that 'k' is varied, and mentions a 'delay parameter' without providing its value. No concrete values for common hyperparameters like learning rates, batch sizes, or optimizer settings are given in the main text.