Biased Dueling Bandits with Stochastic Delayed Feedback

Authors: Bongsoo Yi, Yue Kang, Yao Li

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets. [...] We conduct an empirical evaluation of the performance of RUCB-Delay and MRR-DB-Delay using six synthetic and real-world datasets.
Researcher Affiliation Academia Bongsoo Yi EMAIL Department of Statistics and Operations Research University of North Carolina at Chapel Hill Yue Kang EMAIL Department of Statistics University of California, Davis Yao Li EMAIL Department of Statistics and Operations Research University of North Carolina at Chapel Hill
Pseudocode Yes Algorithm 1 RUCB-Delay Input: Time horizon T, α, M, {τd}M d=1, A = {1, 2, ..., K} Initialization: [...] Algorithm 2 Multi Round-Robin Dueling Bandit with Delayed Feedback (MRR-DB-Delay) Input: Time horizon T, {nm}m N Initialization: γ1 = 1 2, t = 1, m = 1, A1 = {1, 2, ..., K}, Tij(0) = for all i, j A1
Open Source Code No The paper does not provide any links to source code, explicit statements about code release, or mention of code in supplementary materials.
Open Datasets Yes Six rankers (K = 6): a preference matrix generated from the six retrieval functions within the full-text search engine of Ar Xiv.org (Yue & Joachims, 2011). MSLR (K = 5): a 5 × 5 preference matrix introduced by Zoghi et al. (2015a) is extracted from a subset of rankers originating from the Microsoft Learning to Rank (MSLR) dataset (Qin & Liu, 2013). Tennis (K = 8): a dataset, constructed by Ramamohan et al. (2016), is based on the results of tennis matches organized by the Association of Tennis Professionals (ATP) among 8 international tennis players. [...] Car Preference (K = 10): a dataset of car preferences (Abbasnejad et al., 2013) collected from 60 users in the United States. [...] Sushi (K = 16): a dataset derived from the sushi preference dataset (Kamishima, 2003), comprising the preferences of 5,000 Japanese users for 100 different types of sushi. Komiyama et al. (2015; 2016) selected 16 sushi types from the dataset and represented them in a preference matrix.
Dataset Splits No The paper describes using synthetic and real-world datasets and conducting 100 runs for regret assessment. However, it does not specify any training, testing, or validation splits for these datasets. The experiment setup focuses on a time horizon for the bandit problem rather than explicit data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper discusses various algorithms and theoretical analyses but does not list any specific software dependencies or their version numbers used for implementation or experimentation.
Experiment Setup Yes For all experiments, we set the time horizon to T = 200, 000. [...] Similar to Vernade et al. (2017; 2020), we assume that the delay distribution follows a geometric distribution with p = 0.01, implying a mean E[D] = 100. Also, based on our regret analysis in Theorem 2, we set α = 1.0 for RUCB-Delay. [...] We set the windowing parameter M = 1000.