reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Directly Forecasting Belief for Reinforcement Learning with Delays

Authors: Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yixuan Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Jürgen Schmidhuber, Chao Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT s capability to forecast state sequences also facilitates multistep bootstrapping, thus greatly improving learning efficiency. On the Mu Jo Co benchmark, our DFBT-based method substantially outperforms SOTA baselines. Empirically, using the D4RL benchmark (Fu et al., 2020), we show that DFBT achieves significantly higher prediction accuracy compared to other belief methods. On the Mu Jo Co benchmark (Todorov et al., 2012), across various delays settings, we demonstrate that our DFBT-SAC consistently outperforms SOTA augmentation-based and belief-based methods in both learning efficiency and overall performance.
Researcher Affiliation	Academia	1University of Southampton 2Gen AI, King Abdullah University of Science and Technology 3Northwestern University 4National Taiwan University 5Nanyang Technological University 6The Swiss AI Lab IDSIA/USI/SUPSI. Correspondence to: Chao Huang <EMAIL>.
Pseudocode	Yes	The pseudo-code of DFBT-SAC is provided in Algorithm 1.
Open Source Code	Yes	Code is available at https://github.com/Qingyuan Wu Nothing/DFBT.
Open Datasets	Yes	We adopt D4RL (Fu et al., 2020) and Mu Jo Co (Todorov et al., 2012) as the offline dataset and the benchmark respectively to evaluate our DFBT-SAC.
Dataset Splits	No	While the paper mentions using D4RL offline datasets and training all methods on a 'mixed dataset including random, medium and expert policy demonstrations', it does not explicitly provide specific training/testing/validation splits (e.g., percentages, sample counts, or references to predefined splits) for the experiments conducted in this paper.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or specific cloud instance types) used for running the experiments.
Software Dependencies	No	The implementation of DFBT and DFBT-SAC is based on CORL (Tarasov et al., 2022) and Clean RL (Huang et al., 2022). While these frameworks are mentioned, the paper does not specify version numbers for them or other key software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup	Yes	We detail the hyperparameter settings of DFBT and DFBT-SAC in Table 8 and Table 9, respectively. Table 8 includes: Epoch 1e3, Batch Size 256, Attention Heads Num 4, Layers Num 10, Hidden Dim 256, Attention Dropout Rate 0.1, Residual Dropout Rate 0.1, Hidden Dropout Rate 0.1, Learning Rate 1e-4, Optimizer Adam W, Weight Decay 1e-4, Betas (0.9, 0.999). Table 9 includes: Bootstrapping Steps N 8, Learning Rate (Actor) 3e-4, Learning Rate (Critic) 1e-3, Learning Rate (Entropy) 1e-3, Train Frequency (Actor) 2, Train Frequency (Critic) 1, Soft Update Factor (Critic) 5e-3, Batch Size 256, Neurons [256, 256], Layers 3, Hidden Dim 256, Activation Re LU, Optimizer Adam.