When Maximum Entropy Misleads Policy Optimization
Authors: Ruipeng Zhang, Ya-Chien Chang, Sicun Gao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems. ... Fig. 4 shows the overall performance comparison of the learning curves of SAC and PPO across environments. |
| Researcher Affiliation | Academia | 1Computer Science and Engineering, UC San Diego. Correspondence to: Ruipeng Zhang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SAC with Adaptive Entropy (SAC-Ada Ent) |
| Open Source Code | No | The paper mentions and cites third-party open-source projects like 'Opencat: Open-source quadruped robot' (Petoi Camp), but does not provide specific access to source code for the authors' own methodology or implementation. |
| Open Datasets | Yes | Hopper is the standard Mu Jo Co environment (Todorov et al., 2012) where SAC typically learns faster and more stably than PPO. ... Acrobot is a two-link planar robot arm with one end fixed at the shoulder (Spong, 1995). ... Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. |
| Dataset Splits | No | The paper describes various reinforcement learning environments (e.g., Vehicle, Quadrotor, Opencat, Acrobot, Obstacle2D, Hopper) which typically involve dynamic data generation rather than fixed dataset splits. No specific training, validation, or test dataset splits (e.g., percentages or sample counts) are mentioned for any of the environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU specifications, or memory amounts. |
| Software Dependencies | No | The paper mentions various algorithms and environments like SAC, PPO, DDPG, SQL, OpenAI Gym, and Mujoco, and cites 'Stable-baselines3: Reliable reinforcement learning implementations'. However, it does not specify version numbers for any of the software dependencies used in the experiments. |
| Experiment Setup | Yes | Table 1. Hyperparameters for SAC(SAC-auto-alpha), PPO, and DDPG Table 2. Learning Rates for SAC and PPO Across Different Environments |