Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

Authors: Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, Pascal Poupart

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on standard benchmarks show that Ref Plan significantly improves the performance of conservative offline RL policies. In particular, Ref Plan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.
Researcher Affiliation Academia 1University of Torornto, Toronto 2University of Waterloo, Waterloo.
Pseudocode Yes Algorithm 2 in the appendix summarizes Ref Plan.2 Additionally, following Sikchi et al. (2021), we apply an uncertainty penalty based on the variance of the returns predicted by the learned model ensemble.
Open Source Code No Nevertheless, we aimed to closely replicate the original policy performance reported in prior studies. Table 7 compares our reproduced results with those originally reported. Overall, our implementation closely matches the original performances, often exceeding them significantly across various datasets. However, in some cases, our reproduced policy checkpoints underperformed compared to the originally reported results, such as CQL on the R datasets, EDAC on Walker2d M and ME datasets, COMBO on the Hopper R and M datasets, and MAPLE on the Hopper MR dataset. We will make our code publicly available upon acceptance.
Open Datasets Yes We evaluate these RQs using the D4RL benchmark (Fu et al., 2020) and its variations, focusing on locomotion tasks in Half Cheetah, Hopper, and Walker2d environments, each with five configurations: random (R), medium (M), medium-replay (MR), medium-expert (ME), and full-replay (FR).
Dataset Splits No The paper uses D4RL benchmark datasets (Fu et al., 2020) and mentions training prior policies on these datasets (e.g., ME, FR datasets) and evaluating on specific scenarios (e.g., OOD states from R dataset). However, it does not provide explicit training/test/validation splits within a single dataset, exact percentages, or sample counts, or refer to predefined splits from external sources for reproducing data partitioning.
Hardware Specification Yes All experiments were conducted on a single machine equipped with an RTX 3090 GPU.
Software Dependencies No The paper mentions tools like W&B (Biewald, 2020) and Bayesian optimization (Snoek et al., 2012) but does not specify version numbers for any key software libraries, programming languages, or frameworks used for implementation.
Experiment Setup Yes Table 6: Hyperparameters for Model Architecture and Training. Table 9-13 outline the hyperparameters used for Ref Plan across the five prior policies discussed in Section 4.