Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens
Authors: Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, Pascal Poupart
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on standard benchmarks show that Ref Plan significantly improves the performance of conservative offline RL policies. In particular, Ref Plan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies. |
| Researcher Affiliation | Academia | 1University of Torornto, Toronto 2University of Waterloo, Waterloo. |
| Pseudocode | Yes | Algorithm 2 in the appendix summarizes Ref Plan.2 Additionally, following Sikchi et al. (2021), we apply an uncertainty penalty based on the variance of the returns predicted by the learned model ensemble. |
| Open Source Code | No | Nevertheless, we aimed to closely replicate the original policy performance reported in prior studies. Table 7 compares our reproduced results with those originally reported. Overall, our implementation closely matches the original performances, often exceeding them significantly across various datasets. However, in some cases, our reproduced policy checkpoints underperformed compared to the originally reported results, such as CQL on the R datasets, EDAC on Walker2d M and ME datasets, COMBO on the Hopper R and M datasets, and MAPLE on the Hopper MR dataset. We will make our code publicly available upon acceptance. |
| Open Datasets | Yes | We evaluate these RQs using the D4RL benchmark (Fu et al., 2020) and its variations, focusing on locomotion tasks in Half Cheetah, Hopper, and Walker2d environments, each with five configurations: random (R), medium (M), medium-replay (MR), medium-expert (ME), and full-replay (FR). |
| Dataset Splits | No | The paper uses D4RL benchmark datasets (Fu et al., 2020) and mentions training prior policies on these datasets (e.g., ME, FR datasets) and evaluating on specific scenarios (e.g., OOD states from R dataset). However, it does not provide explicit training/test/validation splits within a single dataset, exact percentages, or sample counts, or refer to predefined splits from external sources for reproducing data partitioning. |
| Hardware Specification | Yes | All experiments were conducted on a single machine equipped with an RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions tools like W&B (Biewald, 2020) and Bayesian optimization (Snoek et al., 2012) but does not specify version numbers for any key software libraries, programming languages, or frameworks used for implementation. |
| Experiment Setup | Yes | Table 6: Hyperparameters for Model Architecture and Training. Table 9-13 outline the hyperparameters used for Ref Plan across the five prior policies discussed in Section 4. |