reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Authors: Tal Lancewicki, Yishay Mansour

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first optimal regret bound of Θ(H2 SAK)... In the unknown dynamics case we establish regret bound of O(H3S AK), significantly improving the best known result by a factor of H2S5A2. We present the first Policy Optimization algorithms for Online MDPs with aggregate bandit feedback. Under known dynamics we are the first to establish the near-optimal regret bound of Θ(H2 SAK). In the unknown dynamics case, we achieve a regret of O(H3S AK).
Researcher Affiliation	Collaboration	1Blavatnik School of Computer Science, Tel Aviv University, Israel 2Google Research, Tel Aviv, Israel.
Pseudocode	Yes	Algorithm 1 Policy Optimization with Aggregated Bandit Feedback and Known Transition Function
Open Source Code	No	The paper does not provide any explicit statement about releasing source code, a repository link, or mention of code in supplementary materials.
Open Datasets	No	The paper focuses on theoretical analysis and algorithm design for online MDPs. It does not describe experiments performed on specific datasets, nor does it provide access information for any open datasets.
Dataset Splits	No	The paper is theoretical and does not involve experiments on datasets, therefore, no dataset splits are discussed.
Hardware Specification	No	The paper is theoretical and focuses on algorithm design and regret analysis, not experimental validation. Therefore, it does not mention specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and provides algorithms and proofs, but does not detail any specific software implementations or their versioned dependencies.
Experiment Setup	No	The paper presents theoretical algorithms and their regret bounds without describing any experimental setup, hyperparameter values, or specific training configurations.