Efficient Online Reinforcement Learning for Diffusion Policy

Authors: Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted comprehensive comparisons on Mu Jo Co benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant. We conduct extensive empirical evaluation on Mu Jo Co, showing that the proposed algorithms outperform recent diffusion-based online RL baselines in most tasks.
Researcher Affiliation Academia 1Harvard University. 2Georgia Institute of Technology. Emails: Haitong Ma <EMAIL>, Na Li <EMAIL>, Bo Dai <EMAIL>.
Pseudocode Yes Algorithm 1 Diffusion Policy Mirror Descent (DPMD) ... Algorithm 2 Soft Diffusion Actor-Critic (SDAC)
Open Source Code Yes 3The implementation can be found at https://github.com/mahaitongdae/diffusion policy online rl.
Open Datasets Yes evaluated the performance on 10 Open AI Gym Mu Jo Co v4 tasks.
Dataset Splits Yes All environments except Humanoid-v4 are trained over 200K iterations with a total of 1 million environment interactions, while Humanoid-v4 has five times more. The results are evaluated with the average return of 20 episodes across 5 random seeds.
Hardware Specification Yes The computation is conducted on a desktop workstation with AMD Ryzen 9 7950X CPU, 96 GB memory, and NVIDIA RTX 4090 GPU.
Software Dependencies No We implemented the proposed DPMD and SDAC algorithms with the JAX package3
Experiment Setup Yes Table 4. Hyperparameters Name Value Critic learning rate 3e-4 Policy learning rate 3e-4, linear annealing to 3e-5 Diffusion steps 20 Diffusion noise schedules Cosine Policy network hidden layers 3 Policy network hidden neurons 256 Policy network activation Mish Value network hidden layers 3 Value network hidden neurons 256 Value network activation Mish Replay buffer size (off-policy only) 1 million