POTEC: Off-Policy Contextual Bandits for Large Action Spaces via Policy Decomposition

Authors: Yuta Saito, Jihan Yao, Thorsten Joachims

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EMPIRICAL EVALUATION We first evaluate POTEC on synthetic data with ground-truth cluster information to compare the effectiveness of POTEC w/ and w/o true cluster information and w/ and w/o pairwise regression. We then assess the real-world applicability of POTEC on a public recommendation dataset.
Researcher Affiliation Academia Yuta Saito Cornell University EMAIL Jihan Yao University of Washington EMAIL Thorsten Joachims Cornell University EMAIL
Pseudocode Yes Algorithm 1 The POTEC Algorithm Input: logged bandit data D, conventionally trained regression model ˆq(x, a). Output: 1st-stage (policy-based) policy π1st θ and 2nd-stage (regression-based) policy π2nd ψ
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the methodology described in this paper is publicly available.
Open Datasets Yes We now evaluate it on the Kuai Rec dataset (Gao et al., 2022), a publicly available recommendation dataset collected on a short video platform... In addition to synthetic and real-world recommendation data, we performed OPL experiments on two extreme classification datasets provided by Bhatia et al. (2016).
Dataset Splits Yes The logged data we can use for performing OPL takes the form D := {(xi, ai, ri)}n i=1, which contains n independent observations drawn from the logging policy π0... Table 4: Dataset Statistics Dataset ntrain ntest |A| EUR-Lex 4K 15,449 3,865 3,956 Wiki10-31K 14,146 6,616 30,938
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'Adam' as an optimizer and 'scikit-learn' for clustering, but does not provide specific version numbers for these software dependencies or any other libraries.
Experiment Setup Yes We tuned the weight decay hyperparameter, learning rate, batch size, and the number of irrelevant actions for variance reduction for the baseline methods (i.e., IPS-PG and DR-PG) using the test policy value, while we use a fixed set of hyperparameters for POTEC as shown in Table 3... For all methods, we used Adam (Kingma & Ba, 2014) as the optimizer and used neural networks with 3 hidden layers to parameterize the policy.