Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Action-Dependent Optimality-Preserving Reward Shaping
Authors: Grant Collier Forbes, Jianxun Wang, Leonardo Villalobos-Arias, Arnav Jhala, David Roberts
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test ADOPS, as well as prior optimality-preserving reward shaping methods in the Montezuma s Revenge (Bellemare et al., 2013) Atari Learning Environment (ALE) with RND IM (Burda et al., 2019). We find that PBIM, GRM, and PIES all fail to converge to a policy that outperforms the policy trained on the baseline IM. We tested several versions of ADOPS in this environment and found that all versions tested achieve higher performance than the baseline IM. We provide details of our experiments in Appendix A.1. |
| Researcher Affiliation | Academia | 1Department of Computer Science, North Carolina State University, Raleigh, USA. Correspondence to: Grant C. Forbes <EMAIL>, Jianxun Wang <EMAIL>, Leonardo Villalobos-Arias <EMAIL>, Arnav Jhala <EMAIL>, David L. Roberts <EMAIL>. |
| Pseudocode | No | The paper describes mathematical conditions and derivations (e.g., Equation 15, 16, and proofs in Appendix B.2), but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We use the implementation in PPO-RND (Kazemipour, 2022) as an initial codebase. PPO-RND does not have an active license. We conducted experiments using gymnasium API simulated by Arcade Learning Environment (ALE) (Bellemare et al., 2013) for Montezuma s Revenge. |
| Open Datasets | Yes | We test ADOPS, as well as prior optimality-preserving reward shaping methods in the Montezuma s Revenge (Bellemare et al., 2013) Atari Learning Environment (ALE) with RND IM (Burda et al., 2019). |
| Dataset Splits | No | The paper uses the Montezuma's Revenge Atari Learning Environment but does not specify any training/test/validation splits for a dataset. It focuses on the experimental setup within the environment. |
| Hardware Specification | Yes | We conducted our experiments on two servers with Ubuntu 22.04. One server has 12 Intel(R) Xeon(R) CPU E5-1650, 2 NVIDIA Ge Force GTX 1080, and 32 Gi B memory. We run two experiments concurrently on it. The other server has 12 AMD EPYC 7401P CPU, 1 NVIDIA TITAN RTX, and 30 Gi B memory. Each run takes around 30 hours. |
| Software Dependencies | No | Our base model for all trained agents was the PPO algorithm (Schulman et al., 2017), with additional intrinsic rewards from the RND network (Burda et al., 2019). We use the implementation in PPO-RND (Kazemipour, 2022) as an initial codebase. We conducted experiments using gymnasium API simulated by Arcade Learning Environment (ALE) (Bellemare et al., 2013) for Montezuma s Revenge. We use ALE/Montezuma Revenge-v4 in our experiments. Gymnasium API is under the MIT license and ALE is under GPL-2.0 license. The paper mentions various software components and frameworks but does not provide specific version numbers for these software dependencies (e.g., Python version, specific PPO library version, RND library version). |
| Experiment Setup | Yes | Our base model for all trained agents was the PPO algorithm (Schulman et al., 2017), with additional intrinsic rewards from the RND network (Burda et al., 2019). We use the same hyperparameters as the 32-worker Convolutional Neural Net (CNN) runs in (Burda et al., 2019)... For methods requiring normalization, such as PBIM Norm, we used exponential smoothing with α = 0.05. For all ADOPS variants, we used ϵ = 1e 7. We use the same maximum episode length and intrinsic discount γI values as (Burda et al., 2019), which are 4, 500 and .99, respectively. We tested PIES with a ζ decay rate 1/C = 1/15000. |