Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Authors: Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/Qinwen Luo/SSAR.
Researcher Affiliation Academia 1Nanjing University of Aeronautics and Astronautics, Nanjing, China. Correspondence to: Sheng-Jun Huang <EMAIL>.
Pseudocode Yes The complete training procedure is outlined in the pseudocode provided in Algorithm 1.
Open Source Code Yes The implementation is available at https://github.com/Qinwen Luo/SSAR.
Open Datasets Yes Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/Qinwen Luo/SSAR. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits No The paper uses the D4RL benchmark datasets. While it describes how it constructs 'sub-datasets' for training purposes (e.g., selecting trajectories with high returns or valuable actions based on Q and V values), it does not explicitly define standard train/test/validation splits in terms of percentages or counts for model evaluation. The text mentions 'evaluated by the policies trained by with different regularization on all data in the dataset', implying the entire D4RL dataset serves as the training data for offline RL, and performance metrics are reported on the learned policy. Specific details on partitioning the D4RL datasets into train/test/validation sets by the authors are not provided.
Hardware Specification Yes Below is the training time on a 2080 GPU, excluding IQL-style pretraining, showing a slight increase in time cost.
Software Dependencies No The paper mentions using 'deep offline RL library CORL (Tarasov et al., 2022)' and 'official code (https://github.com/Leap Lab THU/Fam O2O)', but it does not specify exact version numbers for these libraries or other key software components like Python or PyTorch. The publication year of the CORL reference is provided, but not a software version number.
Experiment Setup Yes All hyperparameters are kept consistent with the official implementations. For the distribution-aware threshold, in Mujoco tasks, we set nstart to 1 and nend primarily to 3 (1.5 for expert datasets to better mimic high-quality actions). In Antmaze tasks, we set nend to 5 for more efficient exploration. Table 6. The return thresholds for different tasks. Dataset CQL(SA) TD3+BC(SA) halfcheetah-medium-v2 6000 5200 hopper-medium-v2 2500 1800 walker2d-medium-v2 3600 2500 halfcheetah-expert-v2 11000 10500 hopper-expert-v2 3500 3500 walker2d-expert-v2 4800 4500 The decay steps Nend is set to 400,000, while the total number of interaction steps in our experiments is limited to 250,000. And we set the interval for policy updates to a larger value to achieve stable updates.