reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Energy-Weighted Flow Matching for Offline Reinforcement Learning

Authors: Shiyuan Zhang, Weitong Zhang, Quanquan Gu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that the proposed QIPO algorithm improves performance in offline RL tasks. Notably, our algorithm is the first energy-guided diffusion model that operates independently of auxiliary models and the first exact energy-guided flow matching model in the literature. ... We evaluate the performance of QIPO with flow matching and diffusion model on the D4RL tasks (Fu et al., 2020) in this subsection. [followed by Table 2, Table 4, and Figures 4, 5, 6 showing empirical results and comparisons]
Researcher Affiliation	Academia	1Tsinghua University 2UNC-Chapel Hill 3University of California, Los Angeles EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Training Energy-Weighted Diffusion Model Algorithm 2 Q-weighted iterative policy optimization for offline RL (diffusion) Algorithm 3 Training Energy-Weighted Flow Matching Model Algorithm 4 Q-weighted iterative policy optimization for offline RL (Flow Matching) Algorithm 5 Q-weighted iterative policy optimization for offline RL (Diffusion)
Open Source Code	No	To further support research in the community, we will release the model checkpoints following the de-anonymization process.
Open Datasets	Yes	We evaluate the performance of QIPO with flow matching and diffusion model on the D4RL tasks (Fu et al., 2020) in this subsection.
Dataset Splits	No	Input: Offline RL dataset D = {(x, a, x', r)} (from Algorithms 2, 4, 5). The paper refers to D4RL tasks but does not provide specific train/test/validation splits within these datasets.
Hardware Specification	No	Part of this work is supported by the Google Cloud Research Credits program with the award GCP376319164. (This indicates cloud usage, but no specific hardware like GPU/CPU models are listed.)
Software Dependencies	No	We use the DPM-Solver (Lu et al., 2022a) with diffusion-step = 15 to accelerate the generation step (mentions a solver, but no specific version number).
Experiment Setup	Yes	We finetune the score network sθ t obtained after warm-up (Line 5, Algorithm 5) with learning rate 10^-4 to perform the Q-weighted iterative policy optimization. The schedule of the diffusion is the same with Lu et al. (2023) and we use the DPM-Solver (Lu et al., 2022a) with diffusion-step = 15 to accelerate the generation step... We perform the iterative policy optimization (Line 7, Algorithm 2) with K3 = 100 and evaluate the performance of the agent every 5 epochs. We renew the support set with period Krenew = 10. ... We use the soft update (Lillicrap, 2015)with λ = 0.005 to stabilize the update of the score network, ... We conduct an ablation study by changing the guidance scale from β = 3 to β = 10... We ablate the size of the support action set, increasing it from M = 16 to M = 32 and M = 64.