Convergence of Policy Mirror Descent Beyond Compatible Function Approximation
Authors: Uri Sherman, Tomer Koren, Yishay Mansour
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a strictly weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space. |
| Researcher Affiliation | Collaboration | 1Blavatnik School of Computer Science and AI, Tel Aviv University, Tel Aviv, Israel 2Google Research, Tel Aviv, Israel. |
| Pseudocode | Yes | Algorithm 1 Policy Mirror Descent (on-policy) Input: learning rate eta > 0, regularizer R: R^A -> R Initialize pi_1 in Prod for k= 1 to K do Set mu_k := mu_pi_k; hat{Q}_k := hat{Q}_pi_k. pi_{k+1} = arg min_{pi in Prod} E_{s ~ mu_k}[ H * D * hat{Q}_k(s, pi_s) + (1/eta) * B_R(pi_s, pi_k(s)) ] |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories in the main text or supplementary sections. |
| Open Datasets | No | The paper is theoretical in nature and does not conduct experiments using specific datasets. Therefore, it does not mention the availability of any open datasets. |
| Dataset Splits | No | The paper focuses on theoretical analysis and does not involve experimental evaluation on datasets. As such, there is no mention of training/test/validation dataset splits. |
| Hardware Specification | No | The paper is a theoretical work focusing on algorithmic convergence and does not describe any experiments that would require specific hardware. No hardware specifications are mentioned. |
| Software Dependencies | No | The paper presents a theoretical framework and does not mention any specific software or libraries, along with their version numbers, that would be required to reproduce experimental results. |
| Experiment Setup | No | The paper presents a theoretical framework and does not detail any experimental setup, including hyperparameters or system-level training settings, as it does not conduct experiments. |