reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

Authors: Luofeng Liao, Zuyue Fu, Zhuoran Yang, Yixin Wang, Dingli Ma, Mladen Kolar, Zhaoran Wang

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present numerical experiments on the parametric and nonparametric cases described in Section 4. The goal of this section is to verify that Algorithm 1 successfully identifies the transition model based on sequential observational data and recovers the optimal policy by planning with the estimated transition model. Importantly, we aim to quantify how the strength of instrument affects estimation of causal quantities in the sequential setting. All experiments in this section can be reproduced with the code at https://github.com/ChampionRecLuse/ivvi.
Researcher Affiliation	Academia	Luofeng Liao EMAIL Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027, USA; Zuyue Fu EMAIL Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208, USA; Zhuoran Yang EMAIL Department of Statistics and Data Science Yale University New Haven, CT 06520, USA; Yixin Wang EMAIL Department of Statistics University of Michigan Ann Arbor, MI 48109, USA; Dingli Ma EMAIL Department of Information Systems and Operations Management, Michael G. Foster School of Business University of Washington Seattle, WA 98195, USA; Mladen Kolar EMAIL Department of Data Sciences and Operations, Marshall School of Business University of Southern California Los Angeles, CA 90089, USA; Zhaoran Wang EMAIL Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208, USA
Pseudocode	Yes	Algorithm 1 IV-aided Value Iteration (IVVI) 1: Input: Reward functions {rh}H h=1, feature maps φ and ψ, iterations T, stepsizes {ηθ t , ηω t }T t=1, initial estimates K0 and W0, variance σ2, samples {(xt, at, zt, x t)}T 1 t=0 in Assumption A.2. 2: Phase 1 (Estimation of W sad in Eq. 3.7) 3: for t = 0, 1, . . . , T 1 do 4: φt φ(xt, at) , ψt ψ(zt) . 5: Wt+1 Wt ηθ t (Ktψtφ t ) , Kt+1 Kt + ηω t (Ktψtψ t + x tψ t Wtφtψ t ) . 7: Phase 2 (Value iteration) 8: b VH+1( ) 0, c W WT . 9: for h = H, H 1, . . . , 1 do 10: b Qh( , ) rh( , ) + R S b Vh+1(x )Pc W (dx \| , ) . 11: bπh( ) argmaxa b Qh( , a), b Vh( ) maxa b Qh( , a) . 12: end for 13: Output: bπ = {bπh}H h=1 .
Open Source Code	Yes	All experiments in this section can be reproduced with the code at https://github.com/ChampionRecLuse/ivvi.
Open Datasets	Yes	We construct a semi-synthetic data based on Movie Lens 1M dataset (Harper and Konstan, 2015)
Dataset Splits	No	The paper describes generating data within a semi-synthetic setup (e.g., "We generate 80 episodes", "each episode has a horizon H = 1000", "We generate 200 episodes, each with horizon H = 100") rather than using predefined train/test/validation splits for a fixed dataset. The experiments evaluate the learned policy, but no explicit splits of the generated data for model training vs. testing/validation are mentioned in the traditional sense.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions 'computation time' in various tables.
Software Dependencies	No	The paper mentions using the SPEDE algorithm (Ren et al., 2022) but does not specify any software libraries, frameworks, or programming language versions used for its implementation.
Experiment Setup	Yes	For 5-dimension and 10-dimension case, at t-th iteration, stepsizes we use are ηθ t = 0.05 + 1 18+t and ηω t = 1 18+t. For 20-dimension case, stepsizes are ηθ t = 0.06 + 1 18+t and ηω t = 1 18+t. The estimated transition dynamic can be expressed as W T φ(x, a), where W T is the last iterate. In Phase 2, we use the SPEDE algorithm (Ren et al., 2022), which is a planning algorithm based on the observation that under Gaussian noise, the linear spectral feature of the corresponding Markov transition operator can be obtained in a closed form. Moreover, SPEDE is suitable for a continuous state and action space. Note that in the original implementation, SPEDE samples transition functions from its posterior distribution at each episode, but, in our case, we do not need such a sampling. We compare our method with a natural baseline: ordinary regression. For a fair comparison, we perform ordinary regression using the feature map φ(x, a). Let Jk = {(xk h, ak h, xk h+1)}h be the trajectory that includes samples in the k-th episode. The baseline estimator for the transition model, W baseline, is deﬁned as: W baseline := argmin W h=0 Wφ(xh, ah) xh+1 2 2 o .