Task-Agnostic Pre-training and Task-Guided Fine-tuning for Versatile Diffusion Planner
Authors: Chenyou Fan, Chenjia Bai, Zhao Shan, Haoran He, Yang Zhang, Zhen Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning. [...] 5. Experiments In this section, we conduct experiments to evaluate our proposed method and address the following questions: (1) How does the performance of SODP compare to other methods that combine offline pre-training with online fine-tuning? (2) How does the performance of SODP compare to other multi-task learning approaches? (3) How does SODP achieve higher rewards during online fine-tuning? |
| Researcher Affiliation | Collaboration | 1Northwestern Polytechnical University 2Institute of Artificial Intelligence (Tele AI), China Telecom 3Shenzhen Research Institute of Northwestern Polytechnical University 4Tsinghua University 5Hong Kong University of Science and Technology. Correspondence to: Chenjia Bai <EMAIL>, Zhen Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SODP: Two-stage framework for learning from sub-optimal data Input: diffsuion planner θ, N downstream tasks Ti, multi-task sub-optimal data D = N i=1DTi, target buffer Btarget, replay buffer B, episode length L, pre-train NPT and fine-tune NFT steps // pre-training model with sub-optimal data for t = 1, . . . , NPT do Sample (s, a) D, diffusion time step k Uniform({1, . . . , K}), noise ϵ N(0, I) Update θ using the loss function (7) // fine-tuning model for downstream tasks for Ti [T1, . . . , TN] do Initialization: θ θPT; B, Btarget Rollout Ninit proficient trajectories using θ for t = 1, . . . , NFT do while not end of the episode do Obtain samples a0:K t pθ(a0:K t |st) Execute the first Ta steps and get reward r(a0 t) B B (st, a0:K t , r(a0 t)) st st+Ta, t t + Ta // approximate target policy µ if proficient then Btarget Btarget {a0:K t |t {0, Ta, . . . , L}} Compute LTi Imp using batches from B according to Eq. (12) Compute LTi BC using batches from Btarget according to Eq. (14) Update θ using the loss function (15) |
| Open Source Code | No | The implementation is based on the code from https://github.com/Clean Diffuser Team/Clean Diffuser, and we use their default hyperparameters. For Adroit, we use a simplified backbone provided by Simple DP3 (https://github.com/Yanjie Ze/ 3D-Diffusion-Policy), which removes some components in the U-net. The paper does not provide its own code for SODP, but rather references third-party implementations that were used or adapted. |
| Open Datasets | Yes | We conduct experiments on the Meta-World benchmark (Yu et al., 2019) for both state-based and image-based tasks. We also perform image-based experiments on the Adroit benchmark (Rajeswaran et al., 2018). |
| Dataset Splits | Yes | Following previous work (He et al., 2023), we use a sub-optimal offline dataset containing 1M transitions for each task. The dataset consists of the first 50% of experiences collected from the replay buffer of an SAC (Haarnoja et al., 2018) agent during training. [...] All baselines, along with SODP, are pre-trained on the same dataset containing 50M transitions and are subsequently fine-tuned on each task with 1M transitions. |
| Hardware Specification | Yes | All results are obtained using a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions specific tools and optimizers like 'diffusion policy', 'U-net architecture', 'Featurewise Linear Modulation (FiLM)', 'DDIM', and 'Adam optimizer', but it does not provide specific version numbers for these software components or any programming languages or libraries used for implementation. |
| Experiment Setup | Yes | For pretraining, we use cosine schedule for βk (Nichol & Dhariwal, 2021) and set diffusion steps K = 100. We pre-train the model for 5e5 steps in Meta-Wrold and 3e3 steps in Adroit. [...] For fine-tuning, we use DDIM (Song et al., 2021) with 10 sampling steps and η = 1. We fine-tune each task for 1e6 steps in Meta-World and 3e3 steps in Adroit. Following DPOK (Fan et al., 2024), we perform pstep {10, 30} gradient steps per episode. We set discount factor γ = 1 for all tasks. [...] We set Ninit {10, 20} for approximating target distribution and λ = 1.0 as the BC weight coefficient. [...] Batch size is set to 256 for both pre-training and fine-tuning. [...] We use Adam optimizer (Kingma, 2014) with default parameters for both pre-training and fine-tuning. Learning rate is set to 1e 4 for pretraining and 1e 5 for fine-tuning with exponential decay. |