Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning
Authors: Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance. (Abstract) and Table 1. Comparison of BDPO and various baseline methods on locomotion-v2 and antmaze-v0 datasets from D4RL. |
| Researcher Affiliation | Academia | 1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2The Chinese University of Hong Kong, Shenzhen, China. Correspondence to: Zongzhang Zhang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Behavior-Regularized Diffuion Policy Optimization (BDPO) |
| Open Source Code | Yes | The code and experiment results of BDPO are available on the project webpage. |
| Open Datasets | Yes | Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance. (Abstract) and For the offline dataset, we choose the -v2 datasets with three levels of qualities provided by D4RL (Fu et al., 2020) (Section A). |
| Dataset Splits | No | The paper uses offline datasets from D4RL (Fu et al., 2020), describing their composition (e.g., 'medium', 'medium-replay', 'medium-expert') but does not specify how these datasets were further split into training, testing, or validation sets for their experiments. Evaluation refers to running the learned policy in an environment. |
| Hardware Specification | Yes | We evaluate BDPO, DAC, and Diffusion-QL with workstations equipped with NVIDIA RTX 4090 cards and the walker2dmedium-replay-v2 dataset. |
| Software Dependencies | No | The paper mentions software like 'PyTorch' and 'JAX' (Figure 12) for implementation and 'ADAM' (Table 3) as an optimizer, but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Table 3. Common hyperparameters across all datasets. and Table 4. Hyper-parameters that vary in different tasks. (Section B.1) explicitly list numerous hyperparameters and training settings. |