reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncertainty-aware Reward Design Process

Authors: Yang yang, Xiaolu Zhou, Bosong Ding, Miao Xin

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments: Isaac Gym, Bidexterous Manipulation, and Mani Skill2. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches.
Researcher Affiliation	Academia	Yang Yang EMAIL National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Science Xiaolu Zhou EMAIL School of Mathematical Sciences, Beijing Normal University Bosong Ding EMAIL Air-Lab, Tilburg University Miao Xin EMAIL National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Science
Pseudocode	Yes	Algorithm 1: Uncertainty-aware Reward Design Process
Open Source Code	Yes	We open-source all code at https://github.com/Yy12136/URDP.
Open Datasets	Yes	Our environments consist of three benchmarks: Isaac, Dexterity, and Maniskill2. They comprise 35 different tasks. Nine of these tasks are from the original Isaac Gym environment (Nasir et al., 2024) (Issac), twenty are complex bi-manual tasks (Chen et al., 2022) (Dexterity), and the remaining six are from the Maniskill2 environment (Gu et al.).
Dataset Splits	Yes	To ensure fair comparison with the baseline methods, the success rates for Mani Skill2 tasks are calculated using the last 50% of test results from each evaluation, while full test results are used for Dexterity tasks.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU types, or memory specifications. It mentions environments like Isaac Gym but not the hardware setup for running experiments within them.
Software Dependencies	Yes	All experiments and comparative analyses presented in this paper utilize Deep Seek-v3-241226 (Liu et al., 2024) as the foundational model unless explicitly stated otherwise (see App. G.2 for more open-source LLMs results). For both Isaac and Dexterity environments, we employ the same high-efficiency PPO (Schulman et al., 2017) implementation as used in Eureka... In the Mani Skill2 environment, we utilized both SAC (Haarnoja et al., 2018) and PPO algorithms... The URDP utilizes the BGE-M3 model (Xiao et al., 2024) for the purpose of semantic similarity assessment... In Figure 13, we compare the performance of URDP with Deep Seek-v3-241226 (the results reported in the paper), URDP with Qwen2.5 (qwen-max-0919) (Qwen et al., 2025), and Llama3 (llama-v3-70b-instruct) (Dubey et al., 2024).
Experiment Setup	Yes	5.2 Experimental Setup C Implementation Details C.2 Hyper-parameter Settings All hyperparameters in URDP are listed in Table 3. The reinforcement learning algorithms employed for validation maintain the default configurations specified for each respective environment, with all hyperparameters comprehensively documented in Tables 4 and 5.