Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning
Authors: Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method, validated on tasks with extremely sparse rewards, improves sample efficiency and convergence stability over relevant baselines. We evaluate SASR in high-dimensional environments, including four Mu Jo Co tasks (Todorov et al., 2012), four robotic tasks (de Lazcano et al., 2023), five Atari games, including the well-known Montezuma s Revenge Bellemare et al. (2013), and a physical simulation task (Towers et al., 2023), as shown in Figure 2. All tasks provide extremely sparse rewards, with a reward of 1 granted only upon reaching the final objective within the maximum permitted steps. To ensure robust validation, we run 10 instances per setting with different random seeds and report the average results. We also maintain consistent hyperparameters and network architectures across all tasks, detailed in Appendix A.6. Figure 3 shows the learning performance of SASR compared with the baselines, while Table 1 reports the average episodic returns with standard errors achieved by the final models over 100 episodes. Our findings indicate that SASR surpasses the baselines in terms of sample efficiency, learning stability, and convergence speed. |
| Researcher Affiliation | Academia | Haozhe Ma School of Computing National University of Singapore EMAIL Zhengding Luo School of Electrical and Electronic Engineering Nanyang Technological University EMAIL Thanh Vinh Vo School of Computing National University of Singapore EMAIL Kuankuan Sima Department of Electrical and Computer Engineering National University of Singapore kuankuan EMAIL Tze-Yun Leong School of Computing National University of Singapore EMAIL |
| Pseudocode | Yes | Algorithm 1 Self-Adaptive Success Rate based Reward Shaping |
| Open Source Code | Yes | 1The source code is accessible at: https://github.com/mahaozhe/SASR |
| Open Datasets | Yes | We evaluate SASR in high-dimensional environments, including four Mu Jo Co tasks (Todorov et al., 2012), four robotic tasks (de Lazcano et al., 2023), five Atari games, including the well-known Montezuma s Revenge Bellemare et al. (2013), and a physical simulation task (Towers et al., 2023), as shown in Figure 2. |
| Dataset Splits | No | The paper does not provide traditional training/test/validation dataset splits. In reinforcement learning, the agent interacts with an environment to generate experience, rather than operating on static datasets with predefined splits. The paper mentions running '10 instances per setting with different random seeds' for robust validation, which relates to experimental runs, not dataset splits. |
| Hardware Specification | Yes | The experiments in this paper were conducted on a computing cluster, with the detailed hardware configurations listed in Table 13. Table 13: Component Specification ... Central Processing Unit (CPU) 2x Intel Xeon Gold 6326 ... Graphics Processing Unit (GPU) 1x NVIDIA A100 20GB |
| Software Dependencies | No | The paper mentions the operating system (Ubuntu 20.04) and specific environments like MuJoCo, Gymnasium Robotics, and Atari games, but does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other solvers. |
| Experiment Setup | Yes | Table 12 shows the set of hyperparameters that we used in all of our experiments. Table 12: The hyperparameters used in the SASR algorithm. Hyperparameters Values reward weight λ (default) 0.6 kernel function bandwidth 0.2 random Fourier features dimension M 1000 retention rate ϕ (default) 0.1 discounted factor γ 0.99 replay buffer size |D| 1 106 batch size 256 actor module learning rate 3 10 4 critic module learning rate 1 10 3 SAC entropy term factor α learning rate 1 10 4 policy networks update frequency (steps) 2 target networks update frequency (steps) 1 target networks soft update weight τ 5 10 3 burn-in steps 5000 |