Bootstrapped Reward Shaping

Authors: Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Next, we provide experiments in both tabular and continuous domains with the use of deep neural networks. We find that the use of this simple but dynamical potential function can improve sample complexity, even in complex image-based Atari tasks (Bellemare et al. 2013). We show an experimental advantage in using BSRS in tabular grid-worlds, the Arcade Learning Environment, and locomotion tasks.
Researcher Affiliation Academia Jacob Adamczyk1,2, Volodymyr Makarenko3, Stas Tiomkin4, Rahul V. Kulkarni1,2 1Department of Physics, University of Massachusetts Boston 2The NSF Institute for Artificial Intelligence and Fundamental Interactions 3Department of Computer Engineering, San Jos e State University 4Department of Computer Science, Whitacre College of Engineering, Texas Tech University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes theoretical properties, algorithms like DQN and TD3, and mathematical equations (e.g., Bellman equation), but it does not include a clearly labeled pseudocode block or algorithm section with structured steps.
Open Source Code Yes Our code is publicly available at https://github.com/Jacob HA/Shaped RL.
Open Datasets Yes Next, we provide experiments in both tabular and continuous domains with the use of deep neural networks. We find that the use of this simple but dynamical potential function can improve sample complexity, even in complex image-based Atari tasks (Bellemare et al. 2013). For more complex environments, we use the Arcade Learning Environment suite (Bellemare et al. 2013). To test BSRS in the continuous action setting, we use Pendulum-v1 by extending an implementation of TD3 (Raffin et al. 2021).
Dataset Splits No The paper evaluates performance on various environments (tabular grid-worlds, Atari suite, Pendulum-v1) and mentions running experiments with multiple seeds/random initializations (e.g., "run with five seeds", "averaged over 20 runs"), but it does not specify explicit training, validation, or test dataset splits for these environments or any datasets in the traditional sense.
Hardware Specification No JA would like to acknowledge the use of the supercomputing facilities managed by the Research Computing Department at UMass Boston; the Unity high-performance computing cluster; and funding support from the Alliance Innovation Lab Silicon Valley. This statement refers to general computing facilities and a cluster but lacks specific hardware details like GPU/CPU models or memory configurations.
Software Dependencies No In our experiments, we consider the self-shaped version of the vanilla value-based algorithm DQN (Mnih et al. 2015; Raffin et al. 2021). To test BSRS in the continuous action setting, we use Pendulum-v1 by extending an implementation of TD3 (Raffin et al. 2021). While the paper mentions and cites algorithms and frameworks like DQN, TD3, and Stable-Baselines3 (Raffin et al. 2021), it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Each environment, for each shape-scale parameter η {0, 0.5, 1, 2, 3, 5, 10}, is run with five seeds, and the best (in terms of mean score) non-zero η value is chosen. Learning curves for 10M steps in the Atari suite. Results for each η value were averaged over 20 runs (standard error indicated by shaded region).