reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Authors: Uri Sherman, Tomer Koren, Yishay Mansour

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a strictly weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.
Researcher Affiliation	Collaboration	1Blavatnik School of Computer Science and AI, Tel Aviv University, Tel Aviv, Israel 2Google Research, Tel Aviv, Israel.
Pseudocode	Yes	Algorithm 1 Policy Mirror Descent (on-policy) Input: learning rate eta > 0, regularizer R: R^A -> R Initialize pi_1 in Prod for k= 1 to K do Set mu_k := mu_pi_k; hat{Q}_k := hat{Q}_pi_k. pi_{k+1} = arg min_{pi in Prod} E_{s ~ mu_k}[ H * D * hat{Q}_k(s, pi_s) + (1/eta) * B_R(pi_s, pi_k(s)) ]
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories in the main text or supplementary sections.
Open Datasets	No	The paper is theoretical in nature and does not conduct experiments using specific datasets. Therefore, it does not mention the availability of any open datasets.
Dataset Splits	No	The paper focuses on theoretical analysis and does not involve experimental evaluation on datasets. As such, there is no mention of training/test/validation dataset splits.
Hardware Specification	No	The paper is a theoretical work focusing on algorithmic convergence and does not describe any experiments that would require specific hardware. No hardware specifications are mentioned.
Software Dependencies	No	The paper presents a theoretical framework and does not mention any specific software or libraries, along with their version numbers, that would be required to reproduce experimental results.
Experiment Setup	No	The paper presents a theoretical framework and does not detail any experimental setup, including hyperparameters or system-level training settings, as it does not conduct experiments.