Triple-Optimistic Learning for Stochastic Contextual Bandits with General Constraints
Authors: Hengquan Guo, Lingkai Zu, Xin Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments using the large-scale learning-to-rank dataset Microsoft MSLR-WEB30k (Qin & Liu, 2013). We adopt a similar general constraint setting as in (Guo & Liu, 2024), where the reward function r(x, a) corresponds to the relevance score (normalized to [0, 1]) assigned to the recommended document and the incoming customer. The cost c(x, a) for each arm is randomly generated from a uniform distribution over [ 0.5, 1] and remains fixed throughout each trial. We employ gradient-boosted tree regression and the empirical mean to estimate reward and cost functions, respectively. Observations of rewards and costs are perturbed by Gaussian noise N(0, 0.05). The reported experimental results are averaged over 50 trials, with a 95% confidence interval. Figure 1 demonstrates that Optimistic3 outperforms both LOE2D (Guo & Liu, 2024) and Lagrange CBw LC (Slivkins et al., 2023), which verifies our theoretical results. |
| Researcher Affiliation | Academia | 1School of Information Science and Technology, Shanghai Tech University, Shanghai, China. Correspondence to: Xin Liu <EMAIL>. |
| Pseudocode | Yes | Optimistic3 learning and decision framework Initialization: Q1 = 0, α 2, uniform policy π0, and KL divergence D( || ). For t = 1, , T, Optimistic Learning: Observe the context xt, construct UCB and LCB estimators for rewards ˆft(xt, a) and costs ˇgt(xt, a) from the learning oracles {O}r,c. Optimistic Decision: Construct the optimistic surrogate function for any a A that ˆL(xt, a) = ˆft(xt, a) Qtˇgt(xt, a). (6) Find the optimal policy πt for probabilistic exploration: πt = arg max π π, ˆL(xt) αD(π||πt 1). (7) Sample an action at πt. Observe Feedback: Observe noisy reward rt(xt, at) and cost ct(xt, at). Optimistic Dual Update: Update the optimistic dual variable Qt+1 as follows: Qt+1 = (Qt Ex[ πt 1, ˇgt 1(x) 2 πt, ˇgt(x) ])+ |
| Open Source Code | No | No explicit statement about code release or repository link found. |
| Open Datasets | Yes | In this section, we conduct experiments using the large-scale learning-to-rank dataset Microsoft MSLR-WEB30k (Qin & Liu, 2013). |
| Dataset Splits | No | No specific details on dataset splits (e.g., training, validation, test percentages or counts) are provided in the paper. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory amounts) are mentioned for the experimental setup. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are provided in the paper. |
| Experiment Setup | Yes | The cost c(x, a) for each arm is randomly generated from a uniform distribution over [ 0.5, 1] and remains fixed throughout each trial. We employ gradient-boosted tree regression and the empirical mean to estimate reward and cost functions, respectively. Observations of rewards and costs are perturbed by Gaussian noise N(0, 0.05). The reported experimental results are averaged over 50 trials, with a 95% confidence interval. |