Second Order Bounds for Contextual Bandits with Function Approximation
Authors: Aldo Pacchiano
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Many works have developed no-regret algorithms for contextual bandits with function approximation, where the mean rewards over context-action pairs belong to a function class F. Although there are many approaches to this problem, algorithms based on the principle of optimism, such as optimistic least squares have gained in importance. The regret of optimistic least squares scales as r O a deluderp Fq logp Fq T where deluderp Fq is a statistical measure of the complexity of the function class F known as eluder dimension. Unfortunately, even if the variance of the measurement noise of the rewards at time t equals σ2 t and these are close to zero, the optimistic least squares algorithm s regret scales with ? T. In this work we are the first to develop algorithms that satisfy regret bounds for contextual bandits with function approximation of the form r O σ a logp Fqdeluderp Fq T deluderp Fq logp|F|q when the variances are unknown and satisfy σ2 t σ for all t and r O ˆ deluderp Fq b logp Fq řT t 1 σ2 t deluderp Fq logp|F|q when the variances change at every time-step. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems. |
| Researcher Affiliation | Academia | Aldo Pacchiano Boston University and Broad Institute of MIT and Harvard |
| Pseudocode | Yes | Algorithm 1 Second Order Optimistic Least Squares 1: Input: function class F, variance upper bound σ2. 2: Set the initial confidence set G0 F . 3: for t 1, 2, do 4: Compute regression function for each threshold level τi B 2i for i P t0u Y rqts where qt rlogptqs f τi t argmin f PGt 1 ℓ 1 pfpxℓ, aℓq rℓq21pωpxℓ, aℓ, Gℓq ď τiq Algorithm 2 Unknown-Variance Second Order Optimistic Least Squares 1: Input: probability parameter δ P p0, 1q, function class F. 2: Set the initial confidence set G1 0 F. 3: for t 1, 2, do 4: Compute regression function for each threshold level τi B 2i for i P rqts where qt rlogptqs f pτi,τi 1s t argmin f PG1 t 1 ℓ 1 pfpxℓ, aℓq rℓq21 ωpxℓ, aℓ, G1 ℓq P pτi, τi 1s |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or links to code repositories. |
| Open Datasets | No | The paper is theoretical, focusing on developing algorithms and proving regret bounds for contextual bandits. It does not describe or use any specific datasets for experiments, nor does it provide any information on dataset availability. |
| Dataset Splits | No | The paper is theoretical and does not involve experiments with datasets, thus there is no mention of dataset splits. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and proofs; therefore, no experimental hardware is mentioned or specified. |
| Software Dependencies | No | The paper is theoretical and does not describe any experimental setup that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on deriving bounds and presenting algorithms, rather than conducting empirical experiments. Consequently, there are no details on experimental setup or hyperparameters. |