SUMO: Search-Based Uncertainty Estimation for Model-Based Offline Reinforcement Learning
Authors: Zhongjian Qiao, Jiafei Lyu, Kechen Jiao, Qi Liu, Xiu Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on D4RL datasets demonstrate that SUMO can provide accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Harbin Institute of Technology, Shenzhen EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | The detailed pseudocode of MOPO+SUMO can be found in Appendix B, and the detailed process can summarized as: Step1: Learning the Environmental Dynamics Model: ... We summarize the full pseudocode of AMORe L+SUMO in Appendix B. |
| Open Source Code | Yes | Code https://github.com/qzj-debug/SUMO.git |
| Open Datasets | Yes | Empirically, we conduct extensive experiments on the D4RL (Fu et al. 2020) benchmark, and the experimental results indicate that SUMO can significantly enhance the performance of base algorithms. We also show that SUMO can provide more accurate uncertainty estimation than commonly used model ensemble-based methods. Our contributions can be summarized as follows: |
| Dataset Splits | No | The paper mentions using "D4RL Mu Jo Co datasets" but does not specify custom training, validation, or test splits for these datasets. It describes how synthetic samples are generated from the dataset but not how the original D4RL data itself was split for experimentation. D4RL datasets are commonly used as static datasets for offline RL, typically without further splitting beyond what the benchmark itself provides (if any), but this is not explicitly stated here. |
| Hardware Specification | No | The paper mentions using FAISS as an efficient GPU-based KNN search method, but it does not provide any specific details about the GPU models, CPU models, or other hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions algorithms like SAC and FAISS, but it does not specify any software libraries or dependencies with their version numbers. |
| Experiment Setup | Yes | We run all experiments with 5 different random seeds. We run each algorithm for 1M gradient steps with 5 random seeds. We adopt ˆu to penalize the rewards, ˆr = r − λˆu, where λ is a hyperparameter that controls the magnitude of the penalty. Typically, we set a sampling coefficient η ∈ [0, 1]. In specific, in the process of generating synthetic trajectories, we apply Equation (4) to estimate the uncertainty of each generated sample (ˆs, ˆa, ˆs ) and set a truncating threshold ϵ. If the uncertainty of any sample exceeds this threshold, we consider the sample unreliable and stop generating the trajectory, adding the generated trajectory to the synthetic dataset Dmodel. For flexibility, we can also multiply the threshold ϵ by a coefficient α for adjustment. To ensure the generation of OOD samples, we set the rollout horizon to be 100. The main hyperparameter in SUMO is the number of nearest neighbors k in KNN search. |