Improved Online Confidence Bounds for Multinomial Logistic Bandits

Authors: Joongkyu Lee, Min-Hwan Oh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the performance of our algorithms, OFU-MNL++ and OFU-M2NL, by measuring cumulative regret over T 3000 rounds. The algorithms are tested on 20 independent instances, and we report the average performance along with a shaded area representing two standard deviations. In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. The baselines are the practical and state-of-the-art algorithms: the UCB-based algorithm, UCB-MNL (Oh & Iyengar, 2019), the Thompson Sampling-based algorithm, TS-MNL (Oh & Iyengar, 2019), and the constant-time algorithm, OFU-MNL+ (Lee & Oh, 2024). Figure 1 shows that both of our algorithms significantly outperform the baseline algorithms.
Researcher Affiliation Academia 1Seoul National University, Seoul, Korea.
Pseudocode Yes Algorithm 1 OFU-MNL++ Algorithm 2 RS-OMD, Restricted Space OMD
Open Source Code No The paper does not contain any explicit statement about open-sourcing the code for the methodology described, nor does it provide any links to a code repository.
Open Datasets No In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. This indicates that data is generated for the experiments, not taken from an existing public dataset for which access information would be provided.
Dataset Splits No The paper describes generating synthetic data for its experiments, rather than using pre-existing datasets with defined splits. It states: "In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q."
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments. It does not mention any GPU or CPU models, or other hardware specifications.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for reproducing the experiments.
Experiment Setup Yes We empirically evaluate the performance of our algorithms, OFU-MNL++ and OFU-M2NL, by measuring cumulative regret over T 3000 rounds. The algorithms are tested on 20 independent instances, and we report the average performance along with a shaded area representing two standard deviations. In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. We set the number of items to N 50 and the maximum assortment size to K 5. As an additional experiment, Figure G.1 presents results for a larger value of d, specifically d 10. In our experiments, since the threshold τt is too conservative in practice, we empirically tuned the hyperparameter τt for OFU-MNL++ by searching over a certain range of values while maintaining its inverse relationship with α (i.e., a higher τt corresponds to a lower α).