reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Authors: Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/Qinwen Luo/SSAR.
Researcher Affiliation	Academia	1Nanjing University of Aeronautics and Astronautics, Nanjing, China. Correspondence to: Sheng-Jun Huang <EMAIL>.
Pseudocode	Yes	The complete training procedure is outlined in the pseudocode provided in Algorithm 1.
Open Source Code	Yes	The implementation is available at https://github.com/Qinwen Luo/SSAR.
Open Datasets	Yes	Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/Qinwen Luo/SSAR. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits	No	The paper uses the D4RL benchmark datasets. While it describes how it constructs 'sub-datasets' for training purposes (e.g., selecting trajectories with high returns or valuable actions based on Q and V values), it does not explicitly define standard train/test/validation splits in terms of percentages or counts for model evaluation. The text mentions 'evaluated by the policies trained by with different regularization on all data in the dataset', implying the entire D4RL dataset serves as the training data for offline RL, and performance metrics are reported on the learned policy. Specific details on partitioning the D4RL datasets into train/test/validation sets by the authors are not provided.
Hardware Specification	Yes	Below is the training time on a 2080 GPU, excluding IQL-style pretraining, showing a slight increase in time cost.
Software Dependencies	No	The paper mentions using 'deep offline RL library CORL (Tarasov et al., 2022)' and 'official code (https://github.com/Leap Lab THU/Fam O2O)', but it does not specify exact version numbers for these libraries or other key software components like Python or PyTorch. The publication year of the CORL reference is provided, but not a software version number.
Experiment Setup	Yes	All hyperparameters are kept consistent with the official implementations. For the distribution-aware threshold, in Mujoco tasks, we set nstart to 1 and nend primarily to 3 (1.5 for expert datasets to better mimic high-quality actions). In Antmaze tasks, we set nend to 5 for more efficient exploration. Table 6. The return thresholds for different tasks. Dataset CQL(SA) TD3+BC(SA) halfcheetah-medium-v2 6000 5200 hopper-medium-v2 2500 1800 walker2d-medium-v2 3600 2500 halfcheetah-expert-v2 11000 10500 hopper-expert-v2 3500 3500 walker2d-expert-v2 4800 4500 The decay steps Nend is set to 400,000, while the total number of interaction steps in our experiments is limited to 250,000. And we set the interval for policy updates to a larger value to achieve stable updates.