Energy-Weighted Flow Matching for Offline Reinforcement Learning

Authors: Shiyuan Zhang, Weitong Zhang, Quanquan Gu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that the proposed QIPO algorithm improves performance in offline RL tasks. Notably, our algorithm is the first energy-guided diffusion model that operates independently of auxiliary models and the first exact energy-guided flow matching model in the literature. ... We evaluate the performance of QIPO with flow matching and diffusion model on the D4RL tasks (Fu et al., 2020) in this subsection. [followed by Table 2, Table 4, and Figures 4, 5, 6 showing empirical results and comparisons]
Researcher Affiliation Academia 1Tsinghua University 2UNC-Chapel Hill 3University of California, Los Angeles EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Training Energy-Weighted Diffusion Model Algorithm 2 Q-weighted iterative policy optimization for offline RL (diffusion) Algorithm 3 Training Energy-Weighted Flow Matching Model Algorithm 4 Q-weighted iterative policy optimization for offline RL (Flow Matching) Algorithm 5 Q-weighted iterative policy optimization for offline RL (Diffusion)
Open Source Code No To further support research in the community, we will release the model checkpoints following the de-anonymization process.
Open Datasets Yes We evaluate the performance of QIPO with flow matching and diffusion model on the D4RL tasks (Fu et al., 2020) in this subsection.
Dataset Splits No Input: Offline RL dataset D = {(x, a, x', r)} (from Algorithms 2, 4, 5). The paper refers to D4RL tasks but does not provide specific train/test/validation splits within these datasets.
Hardware Specification No Part of this work is supported by the Google Cloud Research Credits program with the award GCP376319164. (This indicates cloud usage, but no specific hardware like GPU/CPU models are listed.)
Software Dependencies No We use the DPM-Solver (Lu et al., 2022a) with diffusion-step = 15 to accelerate the generation step (mentions a solver, but no specific version number).
Experiment Setup Yes We finetune the score network sθ t obtained after warm-up (Line 5, Algorithm 5) with learning rate 10^-4 to perform the Q-weighted iterative policy optimization. The schedule of the diffusion is the same with Lu et al. (2023) and we use the DPM-Solver (Lu et al., 2022a) with diffusion-step = 15 to accelerate the generation step... We perform the iterative policy optimization (Line 7, Algorithm 2) with K3 = 100 and evaluate the performance of the agent every 5 epochs. We renew the support set with period Krenew = 10. ... We use the soft update (Lillicrap, 2015)with λ = 0.005 to stabilize the update of the score network, ... We conduct an ablation study by changing the guidance scale from β = 3 to β = 10... We ablate the size of the support action set, increasing it from M = 16 to M = 32 and M = 64.