On Stochastic Contextual Bandits with Knapsacks in Small Budget Regime
Authors: Hengquan Guo, Xin Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theory is further supported by experiments conducted on a large-scale dataset. Experiments: We evaluate AUPD using a real-world learning-to-rank dataset in the small budget regime. Our experimental results demonstrate that AUPD outperforms baseline algorithms and achieves the best performance under various initial budgets. The large-scale learning-to-rank dataset MSLR-WEB30k Qin & Liu (2013) contains 31, 278 queries. Each query includes various document-query contexts, each with a dimensionality of 136, and there are 20 documents/arms. The reward function rpx, aq is defined as the relevance of each document-context pair as collected in the dataset. We set the time horizon T 5000 and vary the budgets B t100, 600, 1000u. Our experimental results are shown in Figure 1. These results are obtained by averaging over 50 trials and are reported with a 95% confidence interval. |
| Researcher Affiliation | Academia | Hengquan Guo and Xin Liu School of Information Science and Technology Shanghai Tech University EMAIL |
| Pseudocode | Yes | Algorithm 1 Adaptive and Universal Primal Dual Algorithm for CBw K |
| Open Source Code | No | The paper does not contain any explicit statements about releasing code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use the MSLR-WEB30K dataset Qin & Liu (2013) which is available at https://www.microsoft.com/ en-us/research/project/mslr/. This dataset has 31,278 arrivals, and the contextual dimension is 136. We extract |A| 20 documents (arms) per query and the reward is defined as the relevance judgments in the dataset which take 5 values from 0 (irrelevant) to 4 (perfectly relevant). |
| Dataset Splits | No | The paper mentions that it "randomly draw an arrival from the 31,278 data points" but does not specify any training, test, or validation splits for the dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models. |
| Software Dependencies | No | The paper states that "All the algorithms utilize gradient-boosted tree regression as the learning oracle for the reward function and the empirical mean for the cost function." However, it does not specify version numbers for the gradient-boosted tree regression library or any other software dependencies. |
| Experiment Setup | Yes | We set the time horizon T 5000 and vary the budgets B t100, 600, 1000u to represent the budget regime tΘp ? Tq, Θp T 3 4 q, Θp Tqu, respectively. The interaction terminates once the budget is exhausted, incurring zero reward and cost for the remaining rounds. The further details on the experiments and hyperparameters can be found in Appendix D. Our experimental results are shown in Figure 1. These results are obtained by averaging over 50 trials and are reported with a 95% confidence interval. Table 2: Parameters in the experiment. AUPD Algorithm V 0.1b ? T PGD Adaptive Chzhen et al. (2024) MT 4 ? T 2?T log T, δb 1{ ? T Square CBw K Han et al. (2023) U ?T log T, γ 0.1 a |