reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Habitizing Diffusion Planning for Efficient and Effective Decision Making

Authors: Haofei Lu, Yifei Shen, Dongsheng Li, Junliang Xing, Dongqi Han

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further conduct comprehensive evaluations across various tasks, offering empirical insights into efficient and effective decision making. ... We empirically evaluate Habi on a diverse set of tasks from the D4RL dataset (Fu et al., 2020), one of the most widely used benchmarks for offline RL. ... All the results are calculated over 500 episode seeds for each task to provide a reliable evaluation. HI s results are additionally averaged on 5 training seeds to ensure robustness.
Researcher Affiliation	Collaboration	The work was conducted during the internship of Haofei Lu (EMAIL) at Microsoft Research Asia 1Department of Computer Science and Technology, Tsinghua University 2Microsoft Research Asia. Correspondence to: Junliang Xing <EMAIL>, Dongqi Han <EMAIL>.
Pseudocode	No	The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks. Figure 3 is a diagram, not pseudocode.
Open Source Code	Yes	Our code is anonymously available at https://bayesbrain.github.io/.
Open Datasets	Yes	We empirically evaluate Habi on a diverse set of tasks from the D4RL dataset (Fu et al., 2020), one of the most widely used benchmarks for offline RL.
Dataset Splits	Yes	We empirically evaluate Habi on a diverse set of tasks from the D4RL dataset (Fu et al., 2020), one of the most widely used benchmarks for offline RL. ... All the results are calculated over 500 episode seeds for each task to provide a reliable evaluation. HI s results are additionally averaged on 5 training seeds to ensure robustness.
Hardware Specification	Yes	All runtime measurements were conducted on two different computing hardwares: a laptop CPU (Apple M2 Max) or a server GPU (Nvidia A100). Training was on Nvidia A100 GPUs.
Software Dependencies	No	The paper mentions 'Optimizer Adam' in Table 3 and 'Clean Diffuser (Dong et al., 2024b)' was used to reproduce baselines, but it does not provide specific version numbers for any software libraries or dependencies used for the main methodology.
Experiment Setup	Yes	Table 3. Hyperparameters in our experiments. Settings Value Optimizer Adam Learning Rate 3e-4 Gradient Steps 1000000 Batch Size 256 Latent Dimension: Dim(z) 256 MLP Hidden Size (Encoder & Decoder) 256 MLP Hidden Layers (Encoder & Decoder) 2 Habitization Target (Locomotion Related) DQL (Wang et al., 2023) (Mu Jo Co, Antmaze) Habitization Target (Planning Related) DV (Lu et al., 2025) (Kitchen, Maze2D) Target KL-divergence Dtar KL 1.0 Number of Sampling Candidates in Habitization training 50 Number of Sampling Candidates in Habitual Inference 5