Maximum Entropy Reinforcement Learning with Diffusion Policy
Authors: Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Mujoco benchmarks show that Max Ent DP outperforms the Gaussian policy and other generative models within the Max Ent RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/Max Ent DP. ... In this section, we conduct experiments to address the following questions: (1) Can Max Ent DP effectively learn a multi-modal policy in a multi-goal task? (2) Does the diffusion policy outperform the Gaussian policy and other generative models within the Max Ent RL framework? (3) How does performance vary when replacing the Q-weighted Noise Estimation method with competing approaches, such as QSM and i DEM? (4) How does Max Ent DP compare to other diffusion-based online RL algorithms? (5) Does the Max Ent RL objective benefit policy training? |
| Researcher Affiliation | Academia | Xiaoyi Dong 1 2 Jian Cheng 1 3 4 Xi Sheryl Zhang 1 4 1C2DL, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A. Correspondence to: Xi Sheryl Zhang <EMAIL>. |
| Pseudocode | Yes | The pseudocode for our method is presented in Algorithm 1. Algorithm 1 Max Ent RL with Diffusion Policy |
| Open Source Code | Yes | Our code is available at https://github.com/diffusionyes/Max Ent DP. |
| Open Datasets | Yes | Experimental results on Mujoco benchmarks show that Max Ent DP outperforms the Gaussian policy and other generative models within the Max Ent RL framework... We test Max Ent DP on 3 high-dimensional tasks on the Deep Mind Control Suite benchmarks. |
| Dataset Splits | No | The paper uses "Mujoco benchmarks" and "Deep Mind Control Suite benchmarks" which are simulation environments for Reinforcement Learning. The paper describes experimental setup by mentioning "environment interactions" but does not specify any fixed training/test/validation dataset splits, percentages, or explicit splitting methodology for static datasets, which are not typically applicable in online Reinforcement Learning environments. |
| Hardware Specification | Yes | All experiments in this paper are conducted on a GPU of Nvidia Ge Force RTX 3090 and a CPU of AMD EPYC 7742. |
| Software Dependencies | No | The paper mentions "Leveraging the computation efficiency of JAX (Frostig et al., 2018)" but does not specify a version number for JAX or any other key software libraries used in the implementation, nor for the official codes of baseline algorithms. |
| Experiment Setup | Yes | The shared hyperparameters of all algorithms are listed in Table 1. Table 1. The shared hyperparameters of all algorithms. Hyperparameter Max Ent DP SAC MEow TD3 QSM DACER QVPO DIPO Batch size 256 ... Diffusion steps 20 ... Actor learning rate 3e-4 ... |