reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SUMO: Search-Based Uncertainty Estimation for Model-Based Offline Reinforcement Learning

Authors: Zhongjian Qiao, Jiafei Lyu, Kechen Jiao, Qi Liu, Xiu Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on D4RL datasets demonstrate that SUMO can provide accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation.
Researcher Affiliation	Academia	1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Harbin Institute of Technology, Shenzhen EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The detailed pseudocode of MOPO+SUMO can be found in Appendix B, and the detailed process can summarized as: Step1: Learning the Environmental Dynamics Model: ... We summarize the full pseudocode of AMORe L+SUMO in Appendix B.
Open Source Code	Yes	Code https://github.com/qzj-debug/SUMO.git
Open Datasets	Yes	Empirically, we conduct extensive experiments on the D4RL (Fu et al. 2020) benchmark, and the experimental results indicate that SUMO can significantly enhance the performance of base algorithms. We also show that SUMO can provide more accurate uncertainty estimation than commonly used model ensemble-based methods. Our contributions can be summarized as follows:
Dataset Splits	No	The paper mentions using "D4RL Mu Jo Co datasets" but does not specify custom training, validation, or test splits for these datasets. It describes how synthetic samples are generated from the dataset but not how the original D4RL data itself was split for experimentation. D4RL datasets are commonly used as static datasets for offline RL, typically without further splitting beyond what the benchmark itself provides (if any), but this is not explicitly stated here.
Hardware Specification	No	The paper mentions using FAISS as an efficient GPU-based KNN search method, but it does not provide any specific details about the GPU models, CPU models, or other hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions algorithms like SAC and FAISS, but it does not specify any software libraries or dependencies with their version numbers.
Experiment Setup	Yes	We run all experiments with 5 different random seeds. We run each algorithm for 1M gradient steps with 5 random seeds. We adopt ˆu to penalize the rewards, ˆr = r − λˆu, where λ is a hyperparameter that controls the magnitude of the penalty. Typically, we set a sampling coefficient η ∈ [0, 1]. In specific, in the process of generating synthetic trajectories, we apply Equation (4) to estimate the uncertainty of each generated sample (ˆs, ˆa, ˆs ) and set a truncating threshold ϵ. If the uncertainty of any sample exceeds this threshold, we consider the sample unreliable and stop generating the trajectory, adding the generated trajectory to the synthetic dataset Dmodel. For flexibility, we can also multiply the threshold ϵ by a coefficient α for adjustment. To ensure the generation of OOD samples, we set the rollout horizon to be 100. The main hyperparameter in SUMO is the number of nearest neighbors k in KNN search.