Maximum Entropy Reinforcement Learning with Diffusion Policy

Authors: Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Mujoco benchmarks show that Max Ent DP outperforms the Gaussian policy and other generative models within the Max Ent RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/Max Ent DP. ... In this section, we conduct experiments to address the following questions: (1) Can Max Ent DP effectively learn a multi-modal policy in a multi-goal task? (2) Does the diffusion policy outperform the Gaussian policy and other generative models within the Max Ent RL framework? (3) How does performance vary when replacing the Q-weighted Noise Estimation method with competing approaches, such as QSM and i DEM? (4) How does Max Ent DP compare to other diffusion-based online RL algorithms? (5) Does the Max Ent RL objective benefit policy training?
Researcher Affiliation Academia Xiaoyi Dong 1 2 Jian Cheng 1 3 4 Xi Sheryl Zhang 1 4 1C2DL, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A. Correspondence to: Xi Sheryl Zhang <EMAIL>.
Pseudocode Yes The pseudocode for our method is presented in Algorithm 1. Algorithm 1 Max Ent RL with Diffusion Policy
Open Source Code Yes Our code is available at https://github.com/diffusionyes/Max Ent DP.
Open Datasets Yes Experimental results on Mujoco benchmarks show that Max Ent DP outperforms the Gaussian policy and other generative models within the Max Ent RL framework... We test Max Ent DP on 3 high-dimensional tasks on the Deep Mind Control Suite benchmarks.
Dataset Splits No The paper uses "Mujoco benchmarks" and "Deep Mind Control Suite benchmarks" which are simulation environments for Reinforcement Learning. The paper describes experimental setup by mentioning "environment interactions" but does not specify any fixed training/test/validation dataset splits, percentages, or explicit splitting methodology for static datasets, which are not typically applicable in online Reinforcement Learning environments.
Hardware Specification Yes All experiments in this paper are conducted on a GPU of Nvidia Ge Force RTX 3090 and a CPU of AMD EPYC 7742.
Software Dependencies No The paper mentions "Leveraging the computation efficiency of JAX (Frostig et al., 2018)" but does not specify a version number for JAX or any other key software libraries used in the implementation, nor for the official codes of baseline algorithms.
Experiment Setup Yes The shared hyperparameters of all algorithms are listed in Table 1. Table 1. The shared hyperparameters of all algorithms. Hyperparameter Max Ent DP SAC MEow TD3 QSM DACER QVPO DIPO Batch size 256 ... Diffusion steps 20 ... Actor learning rate 3e-4 ...