Efficient Online Reinforcement Learning for Diffusion Policy
Authors: Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive comparisons on Mu Jo Co benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant. We conduct extensive empirical evaluation on Mu Jo Co, showing that the proposed algorithms outperform recent diffusion-based online RL baselines in most tasks. |
| Researcher Affiliation | Academia | 1Harvard University. 2Georgia Institute of Technology. Emails: Haitong Ma <EMAIL>, Na Li <EMAIL>, Bo Dai <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Diffusion Policy Mirror Descent (DPMD) ... Algorithm 2 Soft Diffusion Actor-Critic (SDAC) |
| Open Source Code | Yes | 3The implementation can be found at https://github.com/mahaitongdae/diffusion policy online rl. |
| Open Datasets | Yes | evaluated the performance on 10 Open AI Gym Mu Jo Co v4 tasks. |
| Dataset Splits | Yes | All environments except Humanoid-v4 are trained over 200K iterations with a total of 1 million environment interactions, while Humanoid-v4 has five times more. The results are evaluated with the average return of 20 episodes across 5 random seeds. |
| Hardware Specification | Yes | The computation is conducted on a desktop workstation with AMD Ryzen 9 7950X CPU, 96 GB memory, and NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | We implemented the proposed DPMD and SDAC algorithms with the JAX package3 |
| Experiment Setup | Yes | Table 4. Hyperparameters Name Value Critic learning rate 3e-4 Policy learning rate 3e-4, linear annealing to 3e-5 Diffusion steps 20 Diffusion noise schedules Cosine Policy network hidden layers 3 Policy network hidden neurons 256 Policy network activation Mish Value network hidden layers 3 Value network hidden neurons 256 Value network activation Mish Replay buffer size (off-policy only) 1 million |