Automatic Reward Shaping from Confounded Offline Data

Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulations support the theoretical findings. In this section, we show simulation results verifying that: (1) Q-UCB with our proposed shaping function enjoys better sample efficiency , and (2) the policy learned by our shaping pipeline at convergence is the optimal policy for an interventional agent.
Researcher Affiliation Academia 1Causal AI Lab, Columbia University, New York, USA 2Department of Electrical Engineering and Computer Science, Syracuse University, New York, USA. Correspondence to: Mingxuan Li <EMAIL>.
Pseudocode Yes Algo. 2 in App. C shows the full pseudo-code for approximating the optimal value upper bound from offline datasets. Details of the algorithm is described in Algo. 1. See also App. F for the pseudo-code of the vanilla Q-UCB.
Open Source Code No The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes We test those algorithms in a series of customized windy Mini Grid environments (Zhang & Bareinboim, 2024; Chevalier-Boisvert et al., 2018). ... Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalistic gridworld environment for gymnasium, 2018. URL https://github.com/Farama-Foundation/Minigrid.
Dataset Splits No The paper describes the collection of 'offline datasets' and 'data-generating process' for different behavioral policies, but it does not specify any training, test, or validation splits for the experimental evaluation of their methods.
Hardware Specification Yes All of our experiment results are obtained from a 2021 Mac Book Pro with M1 chip and 32GB memory.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes Q-UCB (Jin et al., 2018), to leverage the potential function ϕ extrapolated from offline data. Details of the algorithm is described in Algo. 1. Compared with the original Q-UCB, we make a few modifications for Q-UCB to work with PBRS: (1) Zero initializing the Q-values; (2) Using potential function dependent UCB bonus and value clipping; and finally, (3) Incorporating shaped reward during learning updates. ... the episode length is set to 15 while the Lava Cross series has a horizon of 20. To compensate for the hard exploration situation, we allow random initial starting states over the whole map walkable area. For training steps, we set a total of 100K environment steps for Windy Empty World and 20K for the Lava Cross series. ... There is a step penalty of 0.1, +0.2 for getting a coin, 0 for reaching the goal, and -1 for touching the lava.