reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

When Maximum Entropy Misleads Policy Optimization

Authors: Ruipeng Zhang, Ya-Chien Chang, Sicun Gao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems. ... Fig. 4 shows the overall performance comparison of the learning curves of SAC and PPO across environments.
Researcher Affiliation	Academia	1Computer Science and Engineering, UC San Diego. Correspondence to: Ruipeng Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1 SAC with Adaptive Entropy (SAC-Ada Ent)
Open Source Code	No	The paper mentions and cites third-party open-source projects like 'Opencat: Open-source quadruped robot' (Petoi Camp), but does not provide specific access to source code for the authors' own methodology or implementation.
Open Datasets	Yes	Hopper is the standard Mu Jo Co environment (Todorov et al., 2012) where SAC typically learns faster and more stably than PPO. ... Acrobot is a two-link planar robot arm with one end fixed at the shoulder (Spong, 1995). ... Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.
Dataset Splits	No	The paper describes various reinforcement learning environments (e.g., Vehicle, Quadrotor, Opencat, Acrobot, Obstacle2D, Hopper) which typically involve dynamic data generation rather than fixed dataset splits. No specific training, validation, or test dataset splits (e.g., percentages or sample counts) are mentioned for any of the environments.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU specifications, or memory amounts.
Software Dependencies	No	The paper mentions various algorithms and environments like SAC, PPO, DDPG, SQL, OpenAI Gym, and Mujoco, and cites 'Stable-baselines3: Reliable reinforcement learning implementations'. However, it does not specify version numbers for any of the software dependencies used in the experiments.
Experiment Setup	Yes	Table 1. Hyperparameters for SAC(SAC-auto-alpha), PPO, and DDPG Table 2. Learning Rates for SAC and PPO Across Different Environments