Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback
Authors: Qiwei Di, Jiafan He, Quanquan Gu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to evaluate our proposed algorithm RCDB against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA. Correspondence to: Quanquan Gu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Robust Contextual Dueling Bandit (RCDB) Algorithm 2 Robust Contextual Dueling Bandit for Sigmoid link function (RCDB-S) |
| Open Source Code | No | The paper mentions: "We conduct experiments to validate the effectiveness of our algorithm RCDB (See Appendix E)." However, there is no explicit statement about releasing the source code, nor is a link to a code repository provided. |
| Open Datasets | No | Preference Model. We study the effect of adversarial feedback with the preference model determined by (3.1), where σ(x) = 1/(1 + e−x). We randomly generate the underlying parameter in [−0.5, 0.5]d and normalize it to be a vector with ||θ∗||2 = 2. |
| Dataset Splits | No | The paper describes generating synthetic data for a bandit problem, which inherently does not involve traditional training/test/validation splits. It specifies the number of rounds T=2000, but no data partitioning. |
| Hardware Specification | No | The paper states: "In this section, we conduct simulation experiments to verify our theoretical results." However, it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for these simulations. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies or their version numbers (e.g., programming languages, libraries, frameworks, or solvers). |
| Experiment Setup | Yes | E.1 Experiment Setup Preference Model. We study the effect of adversarial feedback with the preference model determined by (3.1), where σ(x) = 1/(1 + e−x). We randomly generate the underlying parameter in [−0.5, 0.5]d and normalize it to be a vector with ||θ∗||2 = 2. Then, we set it to be the underlying parameter and construct the reward utilized in the preference model as r∗(x, a) = ⟨θ∗, ϕ(x, a)⟩. We set the action set A = {±1/√d}d. For simplicity, we assume ϕ(x, a) = a. In our experiment, we set the dimension d = 5, with the size of action set |A| = 2d = 32. Experiment Setup. For each experiment instance, we simulate the interaction with the environment for T = 2000 rounds... We report the cumulative regret averaged across 10 random runs. |