reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A General Framework for Off-Policy Learning with Partially-Observed Reward

Authors: Rikiya Takehi, Masahiro Asami, Kosuke Kawakami, Yuta Saito

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Along with statistical analysis of our proposed methods, empirical evaluations on both synthetic and real-world data show that Hy Pe R outperforms existing methods in various scenarios. We also conduct comprehensive experiments on both synthetic and real-world datasets, where the Hy Pe R algorithm outperforms a range of existing methods in terms of optimizing both the target reward objective and the combined objective of the target and secondary rewards.
Researcher Affiliation	Collaboration	Rikiya Takehi Waseda University EMAIL, Masahiro Asami HAKUHODO Technologies Inc. EMAIL, Kosuke Kawakami HAKUHODO Technologies Inc. EMAIL, Yuta Saito Cornell University EMAIL
Pseudocode	No	The paper describes methods in prose, focusing on mathematical formulations and derivations. There are no explicitly labeled pseudocode blocks or algorithms with structured steps.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. The only link provided (https://kuairec.com/) points to a dataset.
Open Datasets	Yes	To assess the real-world applicability of Hy Pe R, we now evaluate it on the Kuai Rec dataset (Gao et al., 2022).2 This is a publicly available fully-observed user-item matrix data collected on a short video platform, where 1,411 users have viewed all 3,317 videos and left watch duration as feedback. This unique feature of Kuai Rec enables to perform an OPL experiment without synthesizing the reward function (few other public datasets retain this desirable feature).
Dataset Splits	Yes	The most straightforward method is via splitting the dataset D into training (Dtr) and validation (Dval) sets. Then, we train a policy using Dtr (i.e., πθ( ; γ, Dtr)) and estimate the value using Dval (i.e., ˆV (πθ; β, Dval)). However, an issue with this naive procedure is that the policy πθ( ; γ, Dtr) is trained on a smaller dataset Dtr instead of the full dataset D that is used in real training. ... For the Kuai Rec dataset, 'We randomly choose 988 users (70%) for training and 423 users (30%) for evaluation.'
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or cloud computing instance specifications.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	To create synthetic data, we sample 10-dimensional context vectors x from a standard normal distribution, and the sample size is fixed at n = 2000 by default. ... We set this at ϕ = 2.0 by default. ... In the main text, we set the default to σs = 0.5, ... We use λ = 0.7 as a default setting throughout the synthetic experiment. The target reward is sampled from a normal distribution as r N(q(x, a, f(x, a)), σ2 r) with default σr = 0.5 (results with other σr values are provided in Appendix C.2). The target reward observation probability is an experimental parameter and is set to p(o\|x) = 0.2 for all x by default.1 The true weight β, which is used to define the combined policy value Vc(π; β), is set to β = 0.3 and it is also one of the experimental parameters. ... For the Kuai Rec dataset, 'We set the target reward observation probability to p(o\|x) = 0.2 for all x, training data size to n = 1000, and weight β = 0.3, as default experimental parameters. The actions are chosen randomly with size \|A\| = 100, and Appendix C.3 shows results with varying numbers of actions. We define the logging policy as π0(a\|x) = softmax(ϕ(x T MX,Aa + x T θx + a T θa)), with ϕ = 2.0, and run 100 simulations with different train-test splits'.