reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automatic Reward Shaping from Confounded Offline Data

Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulations support the theoretical findings. In this section, we show simulation results verifying that: (1) Q-UCB with our proposed shaping function enjoys better sample efficiency , and (2) the policy learned by our shaping pipeline at convergence is the optimal policy for an interventional agent.
Researcher Affiliation	Academia	1Causal AI Lab, Columbia University, New York, USA 2Department of Electrical Engineering and Computer Science, Syracuse University, New York, USA. Correspondence to: Mingxuan Li <EMAIL>.
Pseudocode	Yes	Algo. 2 in App. C shows the full pseudo-code for approximating the optimal value upper bound from offline datasets. Details of the algorithm is described in Algo. 1. See also App. F for the pseudo-code of the vanilla Q-UCB.
Open Source Code	No	The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We test those algorithms in a series of customized windy Mini Grid environments (Zhang & Bareinboim, 2024; Chevalier-Boisvert et al., 2018). ... Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalistic gridworld environment for gymnasium, 2018. URL https://github.com/Farama-Foundation/Minigrid.
Dataset Splits	No	The paper describes the collection of 'offline datasets' and 'data-generating process' for different behavioral policies, but it does not specify any training, test, or validation splits for the experimental evaluation of their methods.
Hardware Specification	Yes	All of our experiment results are obtained from a 2021 Mac Book Pro with M1 chip and 32GB memory.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	Q-UCB (Jin et al., 2018), to leverage the potential function ϕ extrapolated from offline data. Details of the algorithm is described in Algo. 1. Compared with the original Q-UCB, we make a few modifications for Q-UCB to work with PBRS: (1) Zero initializing the Q-values; (2) Using potential function dependent UCB bonus and value clipping; and finally, (3) Incorporating shaped reward during learning updates. ... the episode length is set to 15 while the Lava Cross series has a horizon of 20. To compensate for the hard exploration situation, we allow random initial starting states over the whole map walkable area. For training steps, we set a total of 100K environment steps for Windy Empty World and 20K for the Lava Cross series. ... There is a step penalty of 0.1, +0.2 for getting a coin, 0 for reaching the goal, and -1 for touching the lava.