Directly Forecasting Belief for Reinforcement Learning with Delays

Authors: Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yixuan Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Jürgen Schmidhuber, Chao Huang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT s capability to forecast state sequences also facilitates multistep bootstrapping, thus greatly improving learning efficiency. On the Mu Jo Co benchmark, our DFBT-based method substantially outperforms SOTA baselines. Empirically, using the D4RL benchmark (Fu et al., 2020), we show that DFBT achieves significantly higher prediction accuracy compared to other belief methods. On the Mu Jo Co benchmark (Todorov et al., 2012), across various delays settings, we demonstrate that our DFBT-SAC consistently outperforms SOTA augmentation-based and belief-based methods in both learning efficiency and overall performance.
Researcher Affiliation Academia 1University of Southampton 2Gen AI, King Abdullah University of Science and Technology 3Northwestern University 4National Taiwan University 5Nanyang Technological University 6The Swiss AI Lab IDSIA/USI/SUPSI. Correspondence to: Chao Huang <EMAIL>.
Pseudocode Yes The pseudo-code of DFBT-SAC is provided in Algorithm 1.
Open Source Code Yes Code is available at https://github.com/Qingyuan Wu Nothing/DFBT.
Open Datasets Yes We adopt D4RL (Fu et al., 2020) and Mu Jo Co (Todorov et al., 2012) as the offline dataset and the benchmark respectively to evaluate our DFBT-SAC.
Dataset Splits No While the paper mentions using D4RL offline datasets and training all methods on a 'mixed dataset including random, medium and expert policy demonstrations', it does not explicitly provide specific training/testing/validation splits (e.g., percentages, sample counts, or references to predefined splits) for the experiments conducted in this paper.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or specific cloud instance types) used for running the experiments.
Software Dependencies No The implementation of DFBT and DFBT-SAC is based on CORL (Tarasov et al., 2022) and Clean RL (Huang et al., 2022). While these frameworks are mentioned, the paper does not specify version numbers for them or other key software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup Yes We detail the hyperparameter settings of DFBT and DFBT-SAC in Table 8 and Table 9, respectively. Table 8 includes: Epoch 1e3, Batch Size 256, Attention Heads Num 4, Layers Num 10, Hidden Dim 256, Attention Dropout Rate 0.1, Residual Dropout Rate 0.1, Hidden Dropout Rate 0.1, Learning Rate 1e-4, Optimizer Adam W, Weight Decay 1e-4, Betas (0.9, 0.999). Table 9 includes: Bootstrapping Steps N 8, Learning Rate (Actor) 3e-4, Learning Rate (Critic) 1e-3, Learning Rate (Entropy) 1e-3, Train Frequency (Actor) 2, Train Frequency (Critic) 1, Soft Update Factor (Critic) 5e-3, Batch Size 256, Neurons [256, 256], Layers 3, Hidden Dim 256, Activation Re LU, Optimizer Adam.