Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning
Authors: Luofeng Liao, Zuyue Fu, Zhuoran Yang, Yixin Wang, Dingli Ma, Mladen Kolar, Zhaoran Wang
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present numerical experiments on the parametric and nonparametric cases described in Section 4. The goal of this section is to verify that Algorithm 1 successfully identifies the transition model based on sequential observational data and recovers the optimal policy by planning with the estimated transition model. Importantly, we aim to quantify how the strength of instrument affects estimation of causal quantities in the sequential setting. All experiments in this section can be reproduced with the code at https://github.com/ChampionRecLuse/ivvi. |
| Researcher Affiliation | Academia | Luofeng Liao EMAIL Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027, USA; Zuyue Fu EMAIL Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208, USA; Zhuoran Yang EMAIL Department of Statistics and Data Science Yale University New Haven, CT 06520, USA; Yixin Wang EMAIL Department of Statistics University of Michigan Ann Arbor, MI 48109, USA; Dingli Ma EMAIL Department of Information Systems and Operations Management, Michael G. Foster School of Business University of Washington Seattle, WA 98195, USA; Mladen Kolar EMAIL Department of Data Sciences and Operations, Marshall School of Business University of Southern California Los Angeles, CA 90089, USA; Zhaoran Wang EMAIL Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208, USA |
| Pseudocode | Yes | Algorithm 1 IV-aided Value Iteration (IVVI) 1: Input: Reward functions {rh}H h=1, feature maps φ and ψ, iterations T, stepsizes {ηθ t , ηω t }T t=1, initial estimates K0 and W0, variance σ2, samples {(xt, at, zt, x t)}T 1 t=0 in Assumption A.2. 2: Phase 1 (Estimation of W sad in Eq. 3.7) 3: for t = 0, 1, . . . , T 1 do 4: φt φ(xt, at) , ψt ψ(zt) . 5: Wt+1 Wt ηθ t (Ktψtφ t ) , Kt+1 Kt + ηω t (Ktψtψ t + x tψ t Wtφtψ t ) . 7: Phase 2 (Value iteration) 8: b VH+1( ) 0, c W WT . 9: for h = H, H 1, . . . , 1 do 10: b Qh( , ) rh( , ) + R S b Vh+1(x )Pc W (dx | , ) . 11: bπh( ) argmaxa b Qh( , a), b Vh( ) maxa b Qh( , a) . 12: end for 13: Output: bπ = {bπh}H h=1 . |
| Open Source Code | Yes | All experiments in this section can be reproduced with the code at https://github.com/ChampionRecLuse/ivvi. |
| Open Datasets | Yes | We construct a semi-synthetic data based on Movie Lens 1M dataset (Harper and Konstan, 2015) |
| Dataset Splits | No | The paper describes generating data within a semi-synthetic setup (e.g., "We generate 80 episodes", "each episode has a horizon H = 1000", "We generate 200 episodes, each with horizon H = 100") rather than using predefined train/test/validation splits for a fixed dataset. The experiments evaluate the learned policy, but no explicit splits of the generated data for model training vs. testing/validation are mentioned in the traditional sense. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions 'computation time' in various tables. |
| Software Dependencies | No | The paper mentions using the SPEDE algorithm (Ren et al., 2022) but does not specify any software libraries, frameworks, or programming language versions used for its implementation. |
| Experiment Setup | Yes | For 5-dimension and 10-dimension case, at t-th iteration, stepsizes we use are ηθ t = 0.05 + 1 18+t and ηω t = 1 18+t. For 20-dimension case, stepsizes are ηθ t = 0.06 + 1 18+t and ηω t = 1 18+t. The estimated transition dynamic can be expressed as W T φ(x, a), where W T is the last iterate. In Phase 2, we use the SPEDE algorithm (Ren et al., 2022), which is a planning algorithm based on the observation that under Gaussian noise, the linear spectral feature of the corresponding Markov transition operator can be obtained in a closed form. Moreover, SPEDE is suitable for a continuous state and action space. Note that in the original implementation, SPEDE samples transition functions from its posterior distribution at each episode, but, in our case, we do not need such a sampling. We compare our method with a natural baseline: ordinary regression. For a fair comparison, we perform ordinary regression using the feature map φ(x, a). Let Jk = {(xk h, ak h, xk h+1)}h be the trajectory that includes samples in the k-th episode. The baseline estimator for the transition model, W baseline, is defined as: W baseline := argmin W h=0 Wφ(xh, ah) xh+1 2 2 o . |