Triple-Optimistic Learning for Stochastic Contextual Bandits with General Constraints

Authors: Hengquan Guo, Lingkai Zu, Xin Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments using the large-scale learning-to-rank dataset Microsoft MSLR-WEB30k (Qin & Liu, 2013). We adopt a similar general constraint setting as in (Guo & Liu, 2024), where the reward function r(x, a) corresponds to the relevance score (normalized to [0, 1]) assigned to the recommended document and the incoming customer. The cost c(x, a) for each arm is randomly generated from a uniform distribution over [ 0.5, 1] and remains fixed throughout each trial. We employ gradient-boosted tree regression and the empirical mean to estimate reward and cost functions, respectively. Observations of rewards and costs are perturbed by Gaussian noise N(0, 0.05). The reported experimental results are averaged over 50 trials, with a 95% confidence interval. Figure 1 demonstrates that Optimistic3 outperforms both LOE2D (Guo & Liu, 2024) and Lagrange CBw LC (Slivkins et al., 2023), which verifies our theoretical results.
Researcher Affiliation Academia 1School of Information Science and Technology, Shanghai Tech University, Shanghai, China. Correspondence to: Xin Liu <EMAIL>.
Pseudocode Yes Optimistic3 learning and decision framework Initialization: Q1 = 0, α 2, uniform policy π0, and KL divergence D( || ). For t = 1, , T, Optimistic Learning: Observe the context xt, construct UCB and LCB estimators for rewards ˆft(xt, a) and costs ˇgt(xt, a) from the learning oracles {O}r,c. Optimistic Decision: Construct the optimistic surrogate function for any a A that ˆL(xt, a) = ˆft(xt, a) Qtˇgt(xt, a). (6) Find the optimal policy πt for probabilistic exploration: πt = arg max π π, ˆL(xt) αD(π||πt 1). (7) Sample an action at πt. Observe Feedback: Observe noisy reward rt(xt, at) and cost ct(xt, at). Optimistic Dual Update: Update the optimistic dual variable Qt+1 as follows: Qt+1 = (Qt Ex[ πt 1, ˇgt 1(x) 2 πt, ˇgt(x) ])+
Open Source Code No No explicit statement about code release or repository link found.
Open Datasets Yes In this section, we conduct experiments using the large-scale learning-to-rank dataset Microsoft MSLR-WEB30k (Qin & Liu, 2013).
Dataset Splits No No specific details on dataset splits (e.g., training, validation, test percentages or counts) are provided in the paper.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, memory amounts) are mentioned for the experimental setup.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are provided in the paper.
Experiment Setup Yes The cost c(x, a) for each arm is randomly generated from a uniform distribution over [ 0.5, 1] and remains fixed throughout each trial. We employ gradient-boosted tree regression and the empirical mean to estimate reward and cost functions, respectively. Observations of rewards and costs are perturbed by Gaussian noise N(0, 0.05). The reported experimental results are averaged over 50 trials, with a 95% confidence interval.