reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Online Reinforcement Learning for Diffusion Policy

Authors: Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted comprehensive comparisons on Mu Jo Co benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant. We conduct extensive empirical evaluation on Mu Jo Co, showing that the proposed algorithms outperform recent diffusion-based online RL baselines in most tasks.
Researcher Affiliation	Academia	1Harvard University. 2Georgia Institute of Technology. Emails: Haitong Ma <EMAIL>, Na Li <EMAIL>, Bo Dai <EMAIL>.
Pseudocode	Yes	Algorithm 1 Diffusion Policy Mirror Descent (DPMD) ... Algorithm 2 Soft Diffusion Actor-Critic (SDAC)
Open Source Code	Yes	3The implementation can be found at https://github.com/mahaitongdae/diffusion policy online rl.
Open Datasets	Yes	evaluated the performance on 10 Open AI Gym Mu Jo Co v4 tasks.
Dataset Splits	Yes	All environments except Humanoid-v4 are trained over 200K iterations with a total of 1 million environment interactions, while Humanoid-v4 has five times more. The results are evaluated with the average return of 20 episodes across 5 random seeds.
Hardware Specification	Yes	The computation is conducted on a desktop workstation with AMD Ryzen 9 7950X CPU, 96 GB memory, and NVIDIA RTX 4090 GPU.
Software Dependencies	No	We implemented the proposed DPMD and SDAC algorithms with the JAX package3
Experiment Setup	Yes	Table 4. Hyperparameters Name Value Critic learning rate 3e-4 Policy learning rate 3e-4, linear annealing to 3e-5 Diffusion steps 20 Diffusion noise schedules Cosine Policy network hidden layers 3 Policy network hidden neurons 256 Policy network activation Mish Value network hidden layers 3 Value network hidden neurons 256 Value network activation Mish Replay buffer size (off-policy only) 1 million