Automatic Reward Shaping from Confounded Offline Data
Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulations support the theoretical findings. In this section, we show simulation results verifying that: (1) Q-UCB with our proposed shaping function enjoys better sample efficiency , and (2) the policy learned by our shaping pipeline at convergence is the optimal policy for an interventional agent. |
| Researcher Affiliation | Academia | 1Causal AI Lab, Columbia University, New York, USA 2Department of Electrical Engineering and Computer Science, Syracuse University, New York, USA. Correspondence to: Mingxuan Li <EMAIL>. |
| Pseudocode | Yes | Algo. 2 in App. C shows the full pseudo-code for approximating the optimal value upper bound from offline datasets. Details of the algorithm is described in Algo. 1. See also App. F for the pseudo-code of the vanilla Q-UCB. |
| Open Source Code | No | The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We test those algorithms in a series of customized windy Mini Grid environments (Zhang & Bareinboim, 2024; Chevalier-Boisvert et al., 2018). ... Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalistic gridworld environment for gymnasium, 2018. URL https://github.com/Farama-Foundation/Minigrid. |
| Dataset Splits | No | The paper describes the collection of 'offline datasets' and 'data-generating process' for different behavioral policies, but it does not specify any training, test, or validation splits for the experimental evaluation of their methods. |
| Hardware Specification | Yes | All of our experiment results are obtained from a 2021 Mac Book Pro with M1 chip and 32GB memory. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Q-UCB (Jin et al., 2018), to leverage the potential function ϕ extrapolated from offline data. Details of the algorithm is described in Algo. 1. Compared with the original Q-UCB, we make a few modifications for Q-UCB to work with PBRS: (1) Zero initializing the Q-values; (2) Using potential function dependent UCB bonus and value clipping; and finally, (3) Incorporating shaped reward during learning updates. ... the episode length is set to 15 while the Lava Cross series has a horizon of 20. To compensate for the hard exploration situation, we allow random initial starting states over the whole map walkable area. For training steps, we set a total of 100K environment steps for Windy Empty World and 20K for the Lava Cross series. ... There is a step penalty of 0.1, +0.2 for getting a coin, 0 for reaching the goal, and -1 for touching the lava. |