Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Authors: Uri Sherman, Tomer Koren, Yishay Mansour

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a strictly weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.
Researcher Affiliation Collaboration 1Blavatnik School of Computer Science and AI, Tel Aviv University, Tel Aviv, Israel 2Google Research, Tel Aviv, Israel.
Pseudocode Yes Algorithm 1 Policy Mirror Descent (on-policy) Input: learning rate eta > 0, regularizer R: R^A -> R Initialize pi_1 in Prod for k= 1 to K do Set mu_k := mu_pi_k; hat{Q}_k := hat{Q}_pi_k. pi_{k+1} = arg min_{pi in Prod} E_{s ~ mu_k}[ H * D * hat{Q}_k(s, pi_s) + (1/eta) * B_R(pi_s, pi_k(s)) ]
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories in the main text or supplementary sections.
Open Datasets No The paper is theoretical in nature and does not conduct experiments using specific datasets. Therefore, it does not mention the availability of any open datasets.
Dataset Splits No The paper focuses on theoretical analysis and does not involve experimental evaluation on datasets. As such, there is no mention of training/test/validation dataset splits.
Hardware Specification No The paper is a theoretical work focusing on algorithmic convergence and does not describe any experiments that would require specific hardware. No hardware specifications are mentioned.
Software Dependencies No The paper presents a theoretical framework and does not mention any specific software or libraries, along with their version numbers, that would be required to reproduce experimental results.
Experiment Setup No The paper presents a theoretical framework and does not detail any experimental setup, including hyperparameters or system-level training settings, as it does not conduct experiments.