Cross-Domain Off-Policy Evaluation and Learning for Contextual Bandits

Authors: Yuta Natsubori, Masataka Ushiku, Yuta Saito

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EMPIRICAL ANALYSIS. This section empirically demonstrates the advantages of COPE and COPE-PG against existing ideas on a real-world public dataset called Kuai Rec (Gao et al., 2022) collected from a recommendation system on a video-sharing app. ... The following reports and discusses the MSE, squared bias, and variance of the OPE estimators computed over 200 sets of logged data, each replicated with different seeds.
Researcher Affiliation Collaboration Yuta Natsubori Hakuhodo DY Holdings, Inc. EMAIL Masataka Ushiku Hakuhodo DY Holdings, Inc. EMAIL Yuta Saito Cornell University EMAIL
Pseudocode No The paper describes methods and formulas mathematically and in natural language, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Our implementation in the experiments relies on one of the most standard methods, unconstrained Least-Squares Importance Fitting (u LSIF), to perform density ratio estimation proposed in (Kanamori et al., 2012; Sugiyama et al., 2012). ... https://github.com/hoxo-m/densratio_py, which we relied on in our experiments, is one of the well-known public implementations of the method.
Open Datasets Yes This section empirically demonstrates the advantages of COPE and COPE-PG against existing ideas on a real-world public dataset called Kuai Rec (Gao et al., 2022) collected from a recommendation system on a video-sharing app.
Dataset Splits No The small matrix of the dataset consists of 1,411 users (denoted as u U), 3,327 items, and 4,676,570 interactions, with a density of 99.6%, which enables OPE/L experiments without synthetic reward functions. ... Iterating this procedure nk times in each domain generate Dk with nk independent copies of (uk, xk u, ak, rk). ... The following reports and discusses the MSE, squared bias, and variance of the OPE estimators computed over 200 sets of logged data, each replicated with different seeds.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No We use Random Forest (Breiman, 2001) implemented in scikit-learn (Pedregosa et al., 2011) along with 3-fold cross-fitting (Newey & Robins, 2018) to obtain ˆq T (x, a) for DR and DM, and ˆq(x, a) for DR-ALL, DM-ALL, and COPE. In addition, for COPE, we use |ϕ(T)| = 4, where we define the target cluster ϕ(T) by the set of domains for which the difference in the empirical average of the rewards, | rk r T |, is small. ... https://github.com/hoxo-m/densratio_py, which we relied on in our experiments, is one of the well-known public implementations of the method.
Experiment Setup Yes We randomly select 30 actions that have at least one interaction with all users for our experiments. We use the user features and watch ratio recorded in the original data as the context xu and expected reward q(xu, a), respectively. ... where ϵ [0, 1] controls the quality of π and we set ϵ = 0.2 as default. We sample the reward rk from a normal distribution with mean q(xu, a) and standard deviation σ = 1. ... Note that we set K = 10 for the number of domains and nk = 100 for the logged data size of each domain as the default experimental parameters. ... we use Random Forest (Breiman, 2001) implemented in scikit-learn (Pedregosa et al., 2011) along with 3-fold cross-fitting (Newey & Robins, 2018) to obtain ˆq T (x, a) for DR and DM, and ˆq(x, a) for DR-ALL, DM-ALL, and COPE. In addition, for COPE, we use |ϕ(T)| = 4