A General Framework for Off-Policy Learning with Partially-Observed Reward

Authors: Rikiya Takehi, Masahiro Asami, Kosuke Kawakami, Yuta Saito

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Along with statistical analysis of our proposed methods, empirical evaluations on both synthetic and real-world data show that Hy Pe R outperforms existing methods in various scenarios. We also conduct comprehensive experiments on both synthetic and real-world datasets, where the Hy Pe R algorithm outperforms a range of existing methods in terms of optimizing both the target reward objective and the combined objective of the target and secondary rewards.
Researcher Affiliation Collaboration Rikiya Takehi Waseda University EMAIL, Masahiro Asami HAKUHODO Technologies Inc. EMAIL, Kosuke Kawakami HAKUHODO Technologies Inc. EMAIL, Yuta Saito Cornell University EMAIL
Pseudocode No The paper describes methods in prose, focusing on mathematical formulations and derivations. There are no explicitly labeled pseudocode blocks or algorithms with structured steps.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. The only link provided (https://kuairec.com/) points to a dataset.
Open Datasets Yes To assess the real-world applicability of Hy Pe R, we now evaluate it on the Kuai Rec dataset (Gao et al., 2022).2 This is a publicly available fully-observed user-item matrix data collected on a short video platform, where 1,411 users have viewed all 3,317 videos and left watch duration as feedback. This unique feature of Kuai Rec enables to perform an OPL experiment without synthesizing the reward function (few other public datasets retain this desirable feature).
Dataset Splits Yes The most straightforward method is via splitting the dataset D into training (Dtr) and validation (Dval) sets. Then, we train a policy using Dtr (i.e., πθ( ; γ, Dtr)) and estimate the value using Dval (i.e., ˆV (πθ; β, Dval)). However, an issue with this naive procedure is that the policy πθ( ; γ, Dtr) is trained on a smaller dataset Dtr instead of the full dataset D that is used in real training. ... For the Kuai Rec dataset, 'We randomly choose 988 users (70%) for training and 423 users (30%) for evaluation.'
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or cloud computing instance specifications.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions) that would be needed to replicate the experiment.
Experiment Setup Yes To create synthetic data, we sample 10-dimensional context vectors x from a standard normal distribution, and the sample size is fixed at n = 2000 by default. ... We set this at ϕ = 2.0 by default. ... In the main text, we set the default to σs = 0.5, ... We use λ = 0.7 as a default setting throughout the synthetic experiment. The target reward is sampled from a normal distribution as r N(q(x, a, f(x, a)), σ2 r) with default σr = 0.5 (results with other σr values are provided in Appendix C.2). The target reward observation probability is an experimental parameter and is set to p(o|x) = 0.2 for all x by default.1 The true weight β, which is used to define the combined policy value Vc(π; β), is set to β = 0.3 and it is also one of the experimental parameters. ... For the Kuai Rec dataset, 'We set the target reward observation probability to p(o|x) = 0.2 for all x, training data size to n = 1000, and weight β = 0.3, as default experimental parameters. The actions are chosen randomly with size |A| = 100, and Appendix C.3 shows results with varying numbers of actions. We define the logging policy as π0(a|x) = softmax(ϕ(x T MX,Aa + x T θx + a T θa)), with ϕ = 2.0, and run 100 simulations with different train-test splits'.