Planning with Consistency Models for Model-Based Offline Reinforcement Learning

Authors: Guanquan Wang, Takuya Hiraoka, Yoshimasa Tsuruoka

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our method on Gym tasks in the D4RL framework, demonstrating that, when compared to its diffusion model counterparts, our method achieves more than a 12-fold increase in speed without any loss in performance. Section 5 Experiment
Researcher Affiliation Collaboration Guanquan Wang EMAIL Department of Information and Communication Engineering The University of Tokyo; Takuya Hiraoka EMAIL NEC Corporation, Tokyo, Japan; Yoshimasa Tsuruoka EMAIL Department of Information and Communication Engineering The University of Tokyo
Pseudocode Yes Algorithm 1 Consistency Distillation with guidance; Algorithm 2 Planning with Consistency Model
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology described in this paper, nor does it provide a link to a code repository.
Open Datasets Yes We validate our method on Gym tasks in the D4RL framework... We evaluate Consistency Planning on D4RL benchmark tasks (Fu et al., 2020) for offline RL... The diffusion model, inverse dynamics model, and consistency model are trained using publicly available D4RL datasets...
Dataset Splits No The paper uses D4RL datasets but does not explicitly provide information on how these datasets were split into training, validation, and test sets for the experiments.
Hardware Specification No The paper states 'on our server' when discussing inference time measurements, but does not provide specific details about the hardware components (e.g., GPU model, CPU, memory) of this server.
Software Dependencies No The paper mentions using '2nd order Heun as ODE solver' and 'Adam optimizer', but it does not provide specific software dependencies like library names with version numbers (e.g., PyTorch version, TensorFlow version) used for implementation.
Experiment Setup Yes We train diffusion model using learning rate of 1e 4 and batch size of 512 for 2e5 train steps with Adam optimizer. We choose the probability p of removing the conditioning information to be 0.25. We use N = 2 for consistency inference. We use a planning horizon H of 32, context length C of 8 in all tasks. We use a guidance scale ωmax = 1, ωmin = 0 in guided consistency distillation.