Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback
Authors: Tal Lancewicki, Yishay Mansour
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first optimal regret bound of Θ(H2 SAK)... In the unknown dynamics case we establish regret bound of O(H3S AK), significantly improving the best known result by a factor of H2S5A2. We present the first Policy Optimization algorithms for Online MDPs with aggregate bandit feedback. Under known dynamics we are the first to establish the near-optimal regret bound of Θ(H2 SAK). In the unknown dynamics case, we achieve a regret of O(H3S AK). |
| Researcher Affiliation | Collaboration | 1Blavatnik School of Computer Science, Tel Aviv University, Israel 2Google Research, Tel Aviv, Israel. |
| Pseudocode | Yes | Algorithm 1 Policy Optimization with Aggregated Bandit Feedback and Known Transition Function |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code, a repository link, or mention of code in supplementary materials. |
| Open Datasets | No | The paper focuses on theoretical analysis and algorithm design for online MDPs. It does not describe experiments performed on specific datasets, nor does it provide access information for any open datasets. |
| Dataset Splits | No | The paper is theoretical and does not involve experiments on datasets, therefore, no dataset splits are discussed. |
| Hardware Specification | No | The paper is theoretical and focuses on algorithm design and regret analysis, not experimental validation. Therefore, it does not mention specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and provides algorithms and proofs, but does not detail any specific software implementations or their versioned dependencies. |
| Experiment Setup | No | The paper presents theoretical algorithms and their regret bounds without describing any experimental setup, hyperparameter values, or specific training configurations. |