Bootstrapped Reward Shaping
Authors: Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Next, we provide experiments in both tabular and continuous domains with the use of deep neural networks. We find that the use of this simple but dynamical potential function can improve sample complexity, even in complex image-based Atari tasks (Bellemare et al. 2013). We show an experimental advantage in using BSRS in tabular grid-worlds, the Arcade Learning Environment, and locomotion tasks. |
| Researcher Affiliation | Academia | Jacob Adamczyk1,2, Volodymyr Makarenko3, Stas Tiomkin4, Rahul V. Kulkarni1,2 1Department of Physics, University of Massachusetts Boston 2The NSF Institute for Artificial Intelligence and Fundamental Interactions 3Department of Computer Engineering, San Jos e State University 4Department of Computer Science, Whitacre College of Engineering, Texas Tech University EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes theoretical properties, algorithms like DQN and TD3, and mathematical equations (e.g., Bellman equation), but it does not include a clearly labeled pseudocode block or algorithm section with structured steps. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Jacob HA/Shaped RL. |
| Open Datasets | Yes | Next, we provide experiments in both tabular and continuous domains with the use of deep neural networks. We find that the use of this simple but dynamical potential function can improve sample complexity, even in complex image-based Atari tasks (Bellemare et al. 2013). For more complex environments, we use the Arcade Learning Environment suite (Bellemare et al. 2013). To test BSRS in the continuous action setting, we use Pendulum-v1 by extending an implementation of TD3 (Raffin et al. 2021). |
| Dataset Splits | No | The paper evaluates performance on various environments (tabular grid-worlds, Atari suite, Pendulum-v1) and mentions running experiments with multiple seeds/random initializations (e.g., "run with five seeds", "averaged over 20 runs"), but it does not specify explicit training, validation, or test dataset splits for these environments or any datasets in the traditional sense. |
| Hardware Specification | No | JA would like to acknowledge the use of the supercomputing facilities managed by the Research Computing Department at UMass Boston; the Unity high-performance computing cluster; and funding support from the Alliance Innovation Lab Silicon Valley. This statement refers to general computing facilities and a cluster but lacks specific hardware details like GPU/CPU models or memory configurations. |
| Software Dependencies | No | In our experiments, we consider the self-shaped version of the vanilla value-based algorithm DQN (Mnih et al. 2015; Raffin et al. 2021). To test BSRS in the continuous action setting, we use Pendulum-v1 by extending an implementation of TD3 (Raffin et al. 2021). While the paper mentions and cites algorithms and frameworks like DQN, TD3, and Stable-Baselines3 (Raffin et al. 2021), it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Each environment, for each shape-scale parameter η {0, 0.5, 1, 2, 3, 5, 10}, is run with five seeds, and the best (in terms of mean score) non-zero η value is chosen. Learning curves for 10M steps in the Atari suite. Results for each η value were averaged over 20 runs (standard error indicated by shaded region). |