reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Triple-Optimistic Learning for Stochastic Contextual Bandits with General Constraints

Authors: Hengquan Guo, Lingkai Zu, Xin Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments using the large-scale learning-to-rank dataset Microsoft MSLR-WEB30k (Qin & Liu, 2013). We adopt a similar general constraint setting as in (Guo & Liu, 2024), where the reward function r(x, a) corresponds to the relevance score (normalized to [0, 1]) assigned to the recommended document and the incoming customer. The cost c(x, a) for each arm is randomly generated from a uniform distribution over [ 0.5, 1] and remains fixed throughout each trial. We employ gradient-boosted tree regression and the empirical mean to estimate reward and cost functions, respectively. Observations of rewards and costs are perturbed by Gaussian noise N(0, 0.05). The reported experimental results are averaged over 50 trials, with a 95% confidence interval. Figure 1 demonstrates that Optimistic3 outperforms both LOE2D (Guo & Liu, 2024) and Lagrange CBw LC (Slivkins et al., 2023), which verifies our theoretical results.
Researcher Affiliation	Academia	1School of Information Science and Technology, Shanghai Tech University, Shanghai, China. Correspondence to: Xin Liu <EMAIL>.
Pseudocode	Yes	Optimistic3 learning and decision framework Initialization: Q1 = 0, α 2, uniform policy π0, and KL divergence D( \|\| ). For t = 1, , T, Optimistic Learning: Observe the context xt, construct UCB and LCB estimators for rewards ˆft(xt, a) and costs ˇgt(xt, a) from the learning oracles {O}r,c. Optimistic Decision: Construct the optimistic surrogate function for any a A that ˆL(xt, a) = ˆft(xt, a) Qtˇgt(xt, a). (6) Find the optimal policy πt for probabilistic exploration: πt = arg max π π, ˆL(xt) αD(π\|\|πt 1). (7) Sample an action at πt. Observe Feedback: Observe noisy reward rt(xt, at) and cost ct(xt, at). Optimistic Dual Update: Update the optimistic dual variable Qt+1 as follows: Qt+1 = (Qt Ex[ πt 1, ˇgt 1(x) 2 πt, ˇgt(x) ])+
Open Source Code	No	No explicit statement about code release or repository link found.
Open Datasets	Yes	In this section, we conduct experiments using the large-scale learning-to-rank dataset Microsoft MSLR-WEB30k (Qin & Liu, 2013).
Dataset Splits	No	No specific details on dataset splits (e.g., training, validation, test percentages or counts) are provided in the paper.
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU types, memory amounts) are mentioned for the experimental setup.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) are provided in the paper.
Experiment Setup	Yes	The cost c(x, a) for each arm is randomly generated from a uniform distribution over [ 0.5, 1] and remains fixed throughout each trial. We employ gradient-boosted tree regression and the empirical mean to estimate reward and cost functions, respectively. Observations of rewards and costs are perturbed by Gaussian noise N(0, 0.05). The reported experimental results are averaged over 50 trials, with a 95% confidence interval.