Improved Online Confidence Bounds for Multinomial Logistic Bandits
Authors: Joongkyu Lee, Min-Hwan Oh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate the performance of our algorithms, OFU-MNL++ and OFU-M2NL, by measuring cumulative regret over T 3000 rounds. The algorithms are tested on 20 independent instances, and we report the average performance along with a shaded area representing two standard deviations. In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. The baselines are the practical and state-of-the-art algorithms: the UCB-based algorithm, UCB-MNL (Oh & Iyengar, 2019), the Thompson Sampling-based algorithm, TS-MNL (Oh & Iyengar, 2019), and the constant-time algorithm, OFU-MNL+ (Lee & Oh, 2024). Figure 1 shows that both of our algorithms significantly outperform the baseline algorithms. |
| Researcher Affiliation | Academia | 1Seoul National University, Seoul, Korea. |
| Pseudocode | Yes | Algorithm 1 OFU-MNL++ Algorithm 2 RS-OMD, Restricted Space OMD |
| Open Source Code | No | The paper does not contain any explicit statement about open-sourcing the code for the methodology described, nor does it provide any links to a code repository. |
| Open Datasets | No | In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. This indicates that data is generated for the experiments, not taken from an existing public dataset for which access information would be provided. |
| Dataset Splits | No | The paper describes generating synthetic data for its experiments, rather than using pre-existing datasets with defined splits. It states: "In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q." |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. It does not mention any GPU or CPU models, or other hardware specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers for reproducing the experiments. |
| Experiment Setup | Yes | We empirically evaluate the performance of our algorithms, OFU-MNL++ and OFU-M2NL, by measuring cumulative regret over T 3000 rounds. The algorithms are tested on 20 independent instances, and we report the average performance along with a shaded area representing two standard deviations. In each instance, the true underlying parameter w is uniformly sampled from the d-dimensional ball Bdp Bq of radius B, and the context features xti are drawn from a Bdp1q. The rewards are sampled from a uniform distribution in each round, i.e., rti Unifp0, 1q. We set the number of items to N 50 and the maximum assortment size to K 5. As an additional experiment, Figure G.1 presents results for a larger value of d, specifically d 10. In our experiments, since the threshold τt is too conservative in practice, we empirically tuned the hyperparameter τt for OFU-MNL++ by searching over a certain range of values while maintaining its inverse relationship with α (i.e., a higher τt corresponds to a lower α). |