Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
Authors: Chenlu Ye, Yujia Jin, Alekh Agarwal, Tong Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we develop an algorithmic approach building on Catoni s estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range R as well as the number of rounds T. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2Open AI (Work done during an internship at Google) 3Google Research 4University of Illinois Urbana-Champaign. Correspondence to: Chenlu Ye <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Catoni-OFUL... Algorithm 2 Variance-Agnostic Catoni Bandit |
| Open Source Code | No | Additionally, while we obtain information-theoretic results in this paper, the algorithms are not easy to implement, both because OFUL-style algorithms are always tricky due to the version space structure, and the function-dependent choice of θ in the way we invoke the Catoni estimator makes things even harder. |
| Open Datasets | No | The paper is theoretical and focuses on developing algorithmic approaches and establishing regret bounds for contextual bandits with heavy-tailed rewards. It does not conduct empirical studies using specific datasets. Examples like 'waiting times in wireless communication networks', 'stock prices in financial markets', or 'value returns for online advertising' are mentioned as scenarios where heavy-tailed rewards arise, not as datasets used in experiments. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical evaluations using datasets, therefore, no dataset splits are provided. |
| Hardware Specification | No | The paper is theoretical and focuses on mathematical analysis and algorithm design without conducting empirical experiments, so no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and focuses on mathematical analysis and algorithm design without conducting empirical experiments, so no software dependencies with specific version numbers are mentioned. |
| Experiment Setup | No | The paper is theoretical and focuses on mathematical analysis and algorithm design without conducting empirical experiments, so no experimental setup details or hyperparameters are provided. |