reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

Authors: Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, Pascal Poupart

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on standard benchmarks show that Ref Plan significantly improves the performance of conservative offline RL policies. In particular, Ref Plan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.
Researcher Affiliation	Academia	1University of Torornto, Toronto 2University of Waterloo, Waterloo.
Pseudocode	Yes	Algorithm 2 in the appendix summarizes Ref Plan.2 Additionally, following Sikchi et al. (2021), we apply an uncertainty penalty based on the variance of the returns predicted by the learned model ensemble.
Open Source Code	No	Nevertheless, we aimed to closely replicate the original policy performance reported in prior studies. Table 7 compares our reproduced results with those originally reported. Overall, our implementation closely matches the original performances, often exceeding them significantly across various datasets. However, in some cases, our reproduced policy checkpoints underperformed compared to the originally reported results, such as CQL on the R datasets, EDAC on Walker2d M and ME datasets, COMBO on the Hopper R and M datasets, and MAPLE on the Hopper MR dataset. We will make our code publicly available upon acceptance.
Open Datasets	Yes	We evaluate these RQs using the D4RL benchmark (Fu et al., 2020) and its variations, focusing on locomotion tasks in Half Cheetah, Hopper, and Walker2d environments, each with five configurations: random (R), medium (M), medium-replay (MR), medium-expert (ME), and full-replay (FR).
Dataset Splits	No	The paper uses D4RL benchmark datasets (Fu et al., 2020) and mentions training prior policies on these datasets (e.g., ME, FR datasets) and evaluating on specific scenarios (e.g., OOD states from R dataset). However, it does not provide explicit training/test/validation splits within a single dataset, exact percentages, or sample counts, or refer to predefined splits from external sources for reproducing data partitioning.
Hardware Specification	Yes	All experiments were conducted on a single machine equipped with an RTX 3090 GPU.
Software Dependencies	No	The paper mentions tools like W&B (Biewald, 2020) and Bayesian optimization (Snoek et al., 2012) but does not specify version numbers for any key software libraries, programming languages, or frameworks used for implementation.
Experiment Setup	Yes	Table 6: Hyperparameters for Model Architecture and Training. Table 9-13 outline the hyperparameters used for Ref Plan across the five prior policies discussed in Section 4.