Uncertainty-aware Reward Design Process

Authors: Yang yang, Xiaolu Zhou, Bosong Ding, Miao Xin

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments: Isaac Gym, Bidexterous Manipulation, and Mani Skill2. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches.
Researcher Affiliation Academia Yang Yang EMAIL National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Science Xiaolu Zhou EMAIL School of Mathematical Sciences, Beijing Normal University Bosong Ding EMAIL Air-Lab, Tilburg University Miao Xin EMAIL National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Science
Pseudocode Yes Algorithm 1: Uncertainty-aware Reward Design Process
Open Source Code Yes We open-source all code at https://github.com/Yy12136/URDP.
Open Datasets Yes Our environments consist of three benchmarks: Isaac, Dexterity, and Maniskill2. They comprise 35 different tasks. Nine of these tasks are from the original Isaac Gym environment (Nasir et al., 2024) (Issac), twenty are complex bi-manual tasks (Chen et al., 2022) (Dexterity), and the remaining six are from the Maniskill2 environment (Gu et al.).
Dataset Splits Yes To ensure fair comparison with the baseline methods, the success rates for Mani Skill2 tasks are calculated using the last 50% of test results from each evaluation, while full test results are used for Dexterity tasks.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU types, or memory specifications. It mentions environments like Isaac Gym but not the hardware setup for running experiments within them.
Software Dependencies Yes All experiments and comparative analyses presented in this paper utilize Deep Seek-v3-241226 (Liu et al., 2024) as the foundational model unless explicitly stated otherwise (see App. G.2 for more open-source LLMs results). For both Isaac and Dexterity environments, we employ the same high-efficiency PPO (Schulman et al., 2017) implementation as used in Eureka... In the Mani Skill2 environment, we utilized both SAC (Haarnoja et al., 2018) and PPO algorithms... The URDP utilizes the BGE-M3 model (Xiao et al., 2024) for the purpose of semantic similarity assessment... In Figure 13, we compare the performance of URDP with Deep Seek-v3-241226 (the results reported in the paper), URDP with Qwen2.5 (qwen-max-0919) (Qwen et al., 2025), and Llama3 (llama-v3-70b-instruct) (Dubey et al., 2024).
Experiment Setup Yes 5.2 Experimental Setup C Implementation Details C.2 Hyper-parameter Settings All hyperparameters in URDP are listed in Table 3. The reinforcement learning algorithms employed for validation maintain the default configurations specified for each respective environment, with all hyperparameters comprehensively documented in Tables 4 and 5.