Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Authors: Tal Lancewicki, Yishay Mansour

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first optimal regret bound of Θ(H2 SAK)... In the unknown dynamics case we establish regret bound of O(H3S AK), significantly improving the best known result by a factor of H2S5A2. We present the first Policy Optimization algorithms for Online MDPs with aggregate bandit feedback. Under known dynamics we are the first to establish the near-optimal regret bound of Θ(H2 SAK). In the unknown dynamics case, we achieve a regret of O(H3S AK).
Researcher Affiliation Collaboration 1Blavatnik School of Computer Science, Tel Aviv University, Israel 2Google Research, Tel Aviv, Israel.
Pseudocode Yes Algorithm 1 Policy Optimization with Aggregated Bandit Feedback and Known Transition Function
Open Source Code No The paper does not provide any explicit statement about releasing source code, a repository link, or mention of code in supplementary materials.
Open Datasets No The paper focuses on theoretical analysis and algorithm design for online MDPs. It does not describe experiments performed on specific datasets, nor does it provide access information for any open datasets.
Dataset Splits No The paper is theoretical and does not involve experiments on datasets, therefore, no dataset splits are discussed.
Hardware Specification No The paper is theoretical and focuses on algorithm design and regret analysis, not experimental validation. Therefore, it does not mention specific hardware used for experiments.
Software Dependencies No The paper is theoretical and provides algorithms and proofs, but does not detail any specific software implementations or their versioned dependencies.
Experiment Setup No The paper presents theoretical algorithms and their regret bounds without describing any experimental setup, hyperparameter values, or specific training configurations.