POTEC: Off-Policy Contextual Bandits for Large Action Spaces via Policy Decomposition
Authors: Yuta Saito, Jihan Yao, Thorsten Joachims
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EMPIRICAL EVALUATION We first evaluate POTEC on synthetic data with ground-truth cluster information to compare the effectiveness of POTEC w/ and w/o true cluster information and w/ and w/o pairwise regression. We then assess the real-world applicability of POTEC on a public recommendation dataset. |
| Researcher Affiliation | Academia | Yuta Saito Cornell University EMAIL Jihan Yao University of Washington EMAIL Thorsten Joachims Cornell University EMAIL |
| Pseudocode | Yes | Algorithm 1 The POTEC Algorithm Input: logged bandit data D, conventionally trained regression model ˆq(x, a). Output: 1st-stage (policy-based) policy π1st θ and 2nd-stage (regression-based) policy π2nd ψ |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the methodology described in this paper is publicly available. |
| Open Datasets | Yes | We now evaluate it on the Kuai Rec dataset (Gao et al., 2022), a publicly available recommendation dataset collected on a short video platform... In addition to synthetic and real-world recommendation data, we performed OPL experiments on two extreme classification datasets provided by Bhatia et al. (2016). |
| Dataset Splits | Yes | The logged data we can use for performing OPL takes the form D := {(xi, ai, ri)}n i=1, which contains n independent observations drawn from the logging policy π0... Table 4: Dataset Statistics Dataset ntrain ntest |A| EUR-Lex 4K 15,449 3,865 3,956 Wiki10-31K 14,146 6,616 30,938 |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Adam' as an optimizer and 'scikit-learn' for clustering, but does not provide specific version numbers for these software dependencies or any other libraries. |
| Experiment Setup | Yes | We tuned the weight decay hyperparameter, learning rate, batch size, and the number of irrelevant actions for variance reduction for the baseline methods (i.e., IPS-PG and DR-PG) using the test policy value, while we use a fixed set of hyperparameters for POTEC as shown in Table 3... For all methods, we used Adam (Kingma & Ba, 2014) as the optimizer and used neural networks with 3 hidden layers to parameterize the policy. |