Diffusion Policy Policy Optimization

Authors: Allen Ren, Justin Lidard, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy (Chi et al., 2024b)) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task.
Researcher Affiliation Collaboration Allen Z. Ren1, Justin Lidard1, Lars L. Ankile2,3, Anthony Simeonov3, Pulkit Agrawal3, Anirudha Majumdar1, Benjamin Burchfiel4, Hongkai Dai4, Max Simchowitz3,5 1Princeton University 2Harvard University 3Massachusetts Institute of Technology 4Toyota Research Institute 5Carnegie Mellon University
Pseudocode Yes The pseudocode for DPPO is presented in Algorithm 1.
Open Source Code Yes Webpage with code: diffusion-ppo.github.io.
Open Datasets Yes Environments: Open AI Gym. We first consider three population Open AI GYM locomotion benchmarks (Brockman et al., 2016) : {Hopper-v2, Walker2D-v2, Half Cheetah-v2}. All policies are pre-trained with the full medium-level datasets from D4RL (Fu et al., 2020) with state input and action chunk size Ta = 4. Environments: Robomimic. Next we consider four simulated robot manipulation tasks from the ROBOMIMIC benchmark (Mandlekar et al., 2021), {Lift, Can, Square, Transport}, ordered in increasing difficulty. Environments: Furniture-Bench & real furniture assembly. Finally, we demonstrate solving longer-horizon, multi-stage robot manipulation tasks from the FURNITURE-BENCH (Heo et al., 2023) benchmark. We use the Avoid environment from D3IL benchmark (Jia et al., 2024)
Dataset Splits No The paper mentions using well-known benchmark datasets such as D4RL, ROBOMIMIC, FURNITURE-BENCH, and D3IL for pre-training and online fine-tuning/evaluation. While these datasets have their inherent structure, the paper does not explicitly state specific training/test/validation splits (e.g., percentages, absolute counts, or direct citations for specific split configurations) that were applied within their experimental setup to reproduce data partitioning.
Hardware Specification Yes Each iteration involves 500 environment timesteps in each of the 40 parallelized environments running on 40 CPU threads and a NVIDIA RTX 2080 GPU (20000 steps total). Each iteration involves 4 episodes (1200 environment timesteps for Lift and Can, 1600 for Square, and 3200 for Transport) from each of the 50 parallelized environments running on 50 CPU threads and a NVIDIA L40 GPU (60000, 80000, 160000 steps). Each iteration involves 1 episodes (700 environment timesteps for One-leg, and 1000 for Lamp and Round-table) from each of the 1000 parallelized environments running on a NVIDIA L40 GPU (700000, 1000000, 1000000 steps). The physical robot used is a Franka Emika Panda arm. Real Sense D435 camera
Software Dependencies No The paper mentions using simulation environments like Mu Jo Co (Todorov et al., 2012) and Isaac Gym (Makoviychuk et al., 2021), and a low-level joint impedance controller provided by Polymetis (Lin et al., 2021). However, it does not specify version numbers for these software components or any other key libraries or programming languages used in their implementation.
Experiment Setup Yes Pre-training. The observations and actions are normalized to [0, 1] using min/max statistics from the pre-training dataset. For all three tasks the policy is trained for 3000 epochs with batch size 128, learning rate of 1e-3 decayed to 1e-4 with a cosine schedule, and weight decay of 1e-6. Exponential Moving Average (EMA) is applied with a decay rate of 0.995. Fine-tuning. All methods from Section 5.1 use the same pre-trained policy. Fine-tuning is done using online experiences sampled from 40 parallelized Mu Jo Co environments (Todorov et al., 2012). Detailed hyperparameters are listed in Table 7.