reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task-Agnostic Pre-training and Task-Guided Fine-tuning for Versatile Diffusion Planner

Authors: Chenyou Fan, Chenjia Bai, Zhao Shan, Haoran He, Yang Zhang, Zhen Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning. [...] 5. Experiments In this section, we conduct experiments to evaluate our proposed method and address the following questions: (1) How does the performance of SODP compare to other methods that combine offline pre-training with online fine-tuning? (2) How does the performance of SODP compare to other multi-task learning approaches? (3) How does SODP achieve higher rewards during online fine-tuning?
Researcher Affiliation	Collaboration	1Northwestern Polytechnical University 2Institute of Artificial Intelligence (Tele AI), China Telecom 3Shenzhen Research Institute of Northwestern Polytechnical University 4Tsinghua University 5Hong Kong University of Science and Technology. Correspondence to: Chenjia Bai <EMAIL>, Zhen Wang <EMAIL>.
Pseudocode	Yes	Algorithm 1 SODP: Two-stage framework for learning from sub-optimal data Input: diffsuion planner θ, N downstream tasks Ti, multi-task sub-optimal data D = N i=1DTi, target buffer Btarget, replay buffer B, episode length L, pre-train NPT and fine-tune NFT steps // pre-training model with sub-optimal data for t = 1, . . . , NPT do Sample (s, a) D, diffusion time step k Uniform({1, . . . , K}), noise ϵ N(0, I) Update θ using the loss function (7) // fine-tuning model for downstream tasks for Ti [T1, . . . , TN] do Initialization: θ θPT; B, Btarget Rollout Ninit proficient trajectories using θ for t = 1, . . . , NFT do while not end of the episode do Obtain samples a0:K t pθ(a0:K t \|st) Execute the first Ta steps and get reward r(a0 t) B B (st, a0:K t , r(a0 t)) st st+Ta, t t + Ta // approximate target policy µ if proficient then Btarget Btarget {a0:K t \|t {0, Ta, . . . , L}} Compute LTi Imp using batches from B according to Eq. (12) Compute LTi BC using batches from Btarget according to Eq. (14) Update θ using the loss function (15)
Open Source Code	No	The implementation is based on the code from https://github.com/Clean Diffuser Team/Clean Diffuser, and we use their default hyperparameters. For Adroit, we use a simplified backbone provided by Simple DP3 (https://github.com/Yanjie Ze/ 3D-Diffusion-Policy), which removes some components in the U-net. The paper does not provide its own code for SODP, but rather references third-party implementations that were used or adapted.
Open Datasets	Yes	We conduct experiments on the Meta-World benchmark (Yu et al., 2019) for both state-based and image-based tasks. We also perform image-based experiments on the Adroit benchmark (Rajeswaran et al., 2018).
Dataset Splits	Yes	Following previous work (He et al., 2023), we use a sub-optimal offline dataset containing 1M transitions for each task. The dataset consists of the first 50% of experiences collected from the replay buffer of an SAC (Haarnoja et al., 2018) agent during training. [...] All baselines, along with SODP, are pre-trained on the same dataset containing 50M transitions and are subsequently fine-tuned on each task with 1M transitions.
Hardware Specification	Yes	All results are obtained using a single NVIDIA RTX 4090 GPU.
Software Dependencies	No	The paper mentions specific tools and optimizers like 'diffusion policy', 'U-net architecture', 'Featurewise Linear Modulation (FiLM)', 'DDIM', and 'Adam optimizer', but it does not provide specific version numbers for these software components or any programming languages or libraries used for implementation.
Experiment Setup	Yes	For pretraining, we use cosine schedule for βk (Nichol & Dhariwal, 2021) and set diffusion steps K = 100. We pre-train the model for 5e5 steps in Meta-Wrold and 3e3 steps in Adroit. [...] For fine-tuning, we use DDIM (Song et al., 2021) with 10 sampling steps and η = 1. We fine-tune each task for 1e6 steps in Meta-World and 3e3 steps in Adroit. Following DPOK (Fan et al., 2024), we perform pstep {10, 30} gradient steps per episode. We set discount factor γ = 1 for all tasks. [...] We set Ninit {10, 20} for approximating target distribution and λ = 1.0 as the BC weight coefficient. [...] Batch size is set to 256 for both pre-training and fine-tuning. [...] We use Adam optimizer (Kingma, 2014) with default parameters for both pre-training and fine-tuning. Learning rate is set to 1e 4 for pretraining and 1e 5 for fine-tuning with exponential decay.