reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

Authors: Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method, validated on tasks with extremely sparse rewards, improves sample efficiency and convergence stability over relevant baselines. We evaluate SASR in high-dimensional environments, including four Mu Jo Co tasks (Todorov et al., 2012), four robotic tasks (de Lazcano et al., 2023), five Atari games, including the well-known Montezuma s Revenge Bellemare et al. (2013), and a physical simulation task (Towers et al., 2023), as shown in Figure 2. All tasks provide extremely sparse rewards, with a reward of 1 granted only upon reaching the final objective within the maximum permitted steps. To ensure robust validation, we run 10 instances per setting with different random seeds and report the average results. We also maintain consistent hyperparameters and network architectures across all tasks, detailed in Appendix A.6. Figure 3 shows the learning performance of SASR compared with the baselines, while Table 1 reports the average episodic returns with standard errors achieved by the final models over 100 episodes. Our findings indicate that SASR surpasses the baselines in terms of sample efficiency, learning stability, and convergence speed.
Researcher Affiliation	Academia	Haozhe Ma School of Computing National University of Singapore EMAIL Zhengding Luo School of Electrical and Electronic Engineering Nanyang Technological University EMAIL Thanh Vinh Vo School of Computing National University of Singapore EMAIL Kuankuan Sima Department of Electrical and Computer Engineering National University of Singapore kuankuan EMAIL Tze-Yun Leong School of Computing National University of Singapore EMAIL
Pseudocode	Yes	Algorithm 1 Self-Adaptive Success Rate based Reward Shaping
Open Source Code	Yes	1The source code is accessible at: https://github.com/mahaozhe/SASR
Open Datasets	Yes	We evaluate SASR in high-dimensional environments, including four Mu Jo Co tasks (Todorov et al., 2012), four robotic tasks (de Lazcano et al., 2023), five Atari games, including the well-known Montezuma s Revenge Bellemare et al. (2013), and a physical simulation task (Towers et al., 2023), as shown in Figure 2.
Dataset Splits	No	The paper does not provide traditional training/test/validation dataset splits. In reinforcement learning, the agent interacts with an environment to generate experience, rather than operating on static datasets with predefined splits. The paper mentions running '10 instances per setting with different random seeds' for robust validation, which relates to experimental runs, not dataset splits.
Hardware Specification	Yes	The experiments in this paper were conducted on a computing cluster, with the detailed hardware configurations listed in Table 13. Table 13: Component Specification ... Central Processing Unit (CPU) 2x Intel Xeon Gold 6326 ... Graphics Processing Unit (GPU) 1x NVIDIA A100 20GB
Software Dependencies	No	The paper mentions the operating system (Ubuntu 20.04) and specific environments like MuJoCo, Gymnasium Robotics, and Atari games, but does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other solvers.
Experiment Setup	Yes	Table 12 shows the set of hyperparameters that we used in all of our experiments. Table 12: The hyperparameters used in the SASR algorithm. Hyperparameters Values reward weight λ (default) 0.6 kernel function bandwidth 0.2 random Fourier features dimension M 1000 retention rate ϕ (default) 0.1 discounted factor γ 0.99 replay buffer size \|D\| 1 106 batch size 256 actor module learning rate 3 10 4 critic module learning rate 1 10 3 SAC entropy term factor α learning rate 1 10 4 policy networks update frequency (steps) 2 target networks update frequency (steps) 1 target networks soft update weight τ 5 10 3 burn-in steps 5000