Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Authors: Xiao Huang, Xu Liu, Enze Zhang, Tong Yu, Shuai Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as Mu Jo Co and Ant Maze. In this section, we show the efficiency of our CFDG method through empirical validation. Section 4.1 commence by showcasing its excellent performance on the D4RL benchmark (Fu et al., 2020) and also shows generalizability and statistical improvements on baselines like IQL (Kostrikov et al., 2021), PEX (Zhang et al., 2023) and APL (Zheng et al., 2023).
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University, Shanghai, China 2Adobe Research, California, United States. Correspondence to: Shuai Li <EMAIL>.
Pseudocode Yes Algorithm 1 Classifier-free guidance sampling in O2O RL. Our additions are highlighted in blue.
Open Source Code No We use the Py Torch implementation of IQL and PEX from https://github.com/Haichao-Zhang/PEX, and the implementation of APL from https://github.com/zhan0903/APL0 and primarily followed the authors recommended parameters. ... We use the implementation at https://github.com/lucidrains/denoisingdiffusion-pytorch. The paper refers to third-party implementations of baselines and a diffusion library, but does not provide an explicit statement or link for the authors' own implementation of CFDG.
Open Datasets Yes Datasets Our method is mainly validated on two D4RL (Fu et al., 2020) benchmarks: Locomotion and Ant Maze, which are used by IQL (Kostrikov et al., 2021) and PEX (Zhang et al., 2023).
Dataset Splits No The paper mentions D4RL benchmarks, which typically have standard splits, but it does not explicitly state the training/test/validation splits used for the experiments. It describes the types of datasets (Locomotion, Ant Maze) but not how they were partitioned.
Hardware Specification Yes We train CFDG integrated with base algorithms on an NVIDIA RTX 2080Ti, with approximately 23 hours required for 10K fine-tuning on Mu Jo Co Locomotion tasks in APL, while 16 hours for 100K fine-tuning in IQL & PEX.
Software Dependencies No The paper mentions using PyTorch for baseline implementations and a specific denoising-diffusion-pytorch library, but it does not provide specific version numbers for any of these software components (e.g., PyTorch version, Python version).
Experiment Setup Yes Settings For IQL and PEX, we perform 1M update steps for offline pre-training and then 1M environment steps for online fine-tuning. For APL, we perform 1M pre-training steps and 0.1M fine-tuning steps to ensure consistency with the original paper. For our data augmentation method, the synthetic buffer size is set to 1M. The update frequency of M is 10K in APL and 100K in IQL and PEX. The generated data ratio r is set to 1/3. Therefore, the percentage of online data, offline data and generated data is 1 : 1 : 1. In the generated data, the ratio of generated online data to generated offline data is 8 : 2. The above configurations keep the same across all tasks, datasets and methods. Some detailed CFDG parameters are included in Appendix A.2. ... The hyperparameters used in our CFDG module are detailed in Table 5: Denoising Network Residual MLP, Denoising Network Depth 6 layers, Denoising Steps 128 steps, Denoising Network Learning Rate 3e-4, Denoising Network Hidden Dimension 1024 units, Denoising Network Batch Size 256 samples, Denoising Network Activation Function ReLU, Denoising Network Optimizer Adam, Learning Rate Schedule Cosine Annealing, Training Epochs 100K epochs, Training Interval Environment Step 10K steps (APL), 100K steps (IQL & PEX).