One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Authors: Zhendong Wang, Max Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, Ming-Yu Liu, Yu Zeng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated One DP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that One DP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided here, and the code will be publicly available soon. We evaluate One DP on a wide variety of tasks in both simulated and real environments. In the following sections, we first report the evaluation results in simulation across six tasks that include different complexity levels. Then we demonstrate the results in the real environment by deploying One DP in the real world with a Franka robot arm for object pick-and-place tasks and a coffee-machine manipulation task. We compare our method with the pre-trained backbone Diffusion Policy (Chi et al., 2023) (DP) and related distillation baseline Consistency Policy (Prasad et al., 2024) (CP). We also report the ablation study results in Appendix E to present more detailed analyses on our method and discuss the effect of different design choices.
Researcher Affiliation Collaboration Zhendong Wang 1 Zhaoshuo Li 2 Ajay Mandlekar 2 Zhenjia Xu 2 Jiaojiao Fan 2 Yashraj Narang 2 Linxi Fan 2 Yuke Zhu 2 Yogesh Balaji 2 Mingyuan Zhou 2 Ming-Yu Liu 1 Yu Zeng 2 1The University of Texas at Austin 2NVIDIA. Correspondence to: Zhendong Wang (Work done during an internship at NVIDIA) <EMAIL>, Yu Zeng <EMAIL >.
Pseudocode Yes Algorithm 1 One DP Training 1: Inputs: action generator Gθ, generator score network πψ, pre-trained diffusion policy πϕ. 2: Initializaiton Gθ πϕ, πψ πϕ. 3: while not converged do 4: Sample Aθ = Gθ(z, O), z N(0, I). 5: Diffuse Ak θ = αk Aθ + σkϵk, ϵk N(0, I). 6: if One DP-S then 7: Update ψ by Equation (6) 8: Update θ by Equation (5) 9: else if One DP-D then 10: Update θ by Equation (8) 11: end if 12: end while
Open Source Code No A video demo is provided here, and the code will be publicly available soon.
Open Datasets Yes Robomimic. Proposed in (Mandlekar et al., 2021), Robomimic is a large-scale benchmark for robotic manipulation tasks. The original benchmark consists of five tasks: Lift, Can, Square, Transport, and Tool Hang. Push T. Adapted from IBC (Florence et al., 2022), Chi et al. (2023) introduced the Push T task, which involves pushing a T-shaped block into a fixed target using a circular end-effector. A dataset of 200 expert demonstrations is provided with RGB image observations.
Dataset Splits No For each of these tasks, the benchmark provides two variants of human demonstrations: proficient human (PH) demonstrations and mixed proficient/nonproficient human (MH) demonstrations. Push T. Adapted from IBC (Florence et al., 2022), Chi et al. (2023) introduced the Push T task, which involves pushing a T-shaped block into a fixed target using a circular end-effector. A dataset of 200 expert demonstrations is provided with RGB image observations. We collect 100 demonstrations each for the pnp-milk and pnp-anything tasks. Separate models are trained for both tasks, with the pnp-anything model utilizing all 200 demonstrations. The pnp-milk-move task is evaluated using the checkpoint from the pnp-anything model. Evaluation. We evaluate the success rate and task completion time from 20 predetermined initial positions for the pnp-milk, pnp-anything, and coffee tasks, as well as 10 motion trajectories for the pnp-milk-move task. (The text describes data collection and evaluation setup, but not how the collected data is split into training/validation/test sets for model development).
Hardware Specification Yes All measurements were taken using a local NVIDIA V100 GPU, with the same neural network size for each method.
Software Dependencies No Following Chi et al. (2023), we construct a diffusion policy using a 1D temporal convolutional neural network (CNN) (Janner et al., 2022) based U-Net and a standard Res Net18 (without pre-training) (He et al., 2016) as the vision encoder. We implement the diffusion policy with two noise scheduling methods: DDPM (Ho et al., 2020) and EDM (Karras et al., 2022). (No specific software versions like PyTorch, TensorFlow, or Python are mentioned.)
Experiment Setup Yes 2.3. Implementation Details Diffusion Policy. Following Chi et al. (2023), we construct a diffusion policy using a 1D temporal convolutional neural network (CNN) (Janner et al., 2022) based U-Net and a standard Res Net18 (without pre-training) (He et al., 2016) as the vision encoder. We implement the diffusion policy with two noise scheduling methods: DDPM (Ho et al., 2020) and EDM (Karras et al., 2022). We use ϵ noise prediction for discrete-time (100 steps) diffusion and x0-prediction for continuous-time diffusion, respectively. ... Distillation. We warm-start both the stochastic and deterministic action generator Gθ, and the generator score network, ϵψ, by duplicating the neural-network structure and weights from the pre-trained diffusion policy, aligning with strategies from Luo et al. (2024); Yin et al. (2024); Xu et al. (2024). The inputs of Gθ include pure noise, a fixed time embedding (an initial timestep for DDPM or initial sigma value for EDM), and observations O. The outputs of Gθ are formulated as direct action predictions. Following Dream Fusion (Poole et al., 2022), we set w(k) = σ2 k. In the discrete-time domain, distillation occurs over [2, 95] diffusion timesteps to avoid edge cases. In continuous-time, we employ the same log-normal noise scheduling as EDM (Karras et al., 2022) used during distillation. The generators operate at a learning rate of 1 10 6, while the generator score network is accelerated to a learning rate of 2 10 5. Vision encoders are also actively trained during the distillation process. ... D. Training Details We follow the CNN-based neural network architecture and observation encoder design from Chi et al. (2023). For simulation experiments, we use a 256-million-parameter version for DDPM and a 67-million-parameter version for EDM, as the smaller EDM network performs slightly better. In real-world experiments, we also use the 67-million-parameter version. Additionally, we adopt the action chunking idea from Chi et al. (2023) and Zhao et al. (2023), using 16 actions per chunk for prediction, and utilize two observations for vision encoding. We first train DP for 1000 epochs in both simulation and real-world experiments with a default learning rate of 1e-4 and weight decay of 1e-6. We then perform distillation using the pre-trained checkpoints, distilling for 20 epochs in simulation and 100 epochs in real-world experiments. For distillation, we warm-start both the stochastic and deterministic action generators, Gθ, and the generator score network, ϵψ, by duplicating the network structure and weights from the pre-trained diffusion-policy checkpoints. Since the generator network is initialized from a denoising network, a timestep input is required, as this was part of the original input. We fix the timestep at 65 for discrete diffusion and choose σ = 2.5 for continuous EDM diffusion. The generator learning rate is set to 1e-6. We find these hyperparameters to be stable without causing significant performance variation. We provide an ablation study that focuses primarily on the generator score network s learning rate and optimizer settings in Appendix E. We provide the hyperparameter details in Table 7. Table 7: Hyperparameters Values generator learning rate lr=1e-6 generator score network learning rate lr=2e-5 generator optimizer Adam([0.0, 0.999]) generator score network optimizer Adam([0.0, 0.999]) action chunk size n=16 number of observations n=2 discrete diffusion init timestep tinit=65 discrete diffusion distillation t range [2, 95] continuous diffusion init sigma σ = 2.5