Learning mirror maps in policy mirror descent

Authors: Carlo Alfano, Sebastian Towers, Silvia Sapora, Chris Lu, Patrick Rebeschini

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World... We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the Min Atar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.
Researcher Affiliation Academia Carlo Alfano Department of Statistics University of Oxford Sebastian Towers FLAIR University of Oxford Silvia Sapora FLAIR, Department of Statistics University of Oxford Chris Lu FLAIR University of Oxford Patrick Rebeschini Department of Statistics University of Oxford Correspondence to EMAIL.
Pseudocode No The paper describes algorithms (PMD and AMPO) using mathematical formulations (equations 2, 5, 6, 7) and textual explanations, but it does not include any distinct pseudocode blocks or algorithm listings with structured steps.
Open Source Code Yes REPRODUCIBILITY STATEMENT The implementation of our experiments can be found at https://github.com/c-alfano/ Learning-mirror-maps.
Open Datasets Yes We first focus on a tabular environment, i.e. Grid-World (Oh et al., 2020)... We then consider two non-tabular settings, i.e. the Basic Control Suite and the Min Atar Suite... Lastly, we tackle continuous control tasks in Mu Jo Co (Todorov et al., 2012)
Dataset Splits Yes We learn a single mirror map by training PMD on a continuous distribution of Grid-World environments, and test PMD with the learned mirror map on five held-out configurations from previous publications (Oh et al., 2020; Chevalier-Boisvert et al., 2024) and on 256 randomly sampled configurations. Our last result consists in testing each learned mirror map across the other environments we consider.
Hardware Specification Yes We run on four A40 GPUs, and the optimization process takes roughly 12 hours. We run on 8 GTX 1080Ti GPUs, and the optimization process takes roughly 48 hours for a single environment. We run on eight A40 GPUs, and the optimization process takes roughly 24 hours.
Software Dependencies No The training procedure is implemented in Jax, using evosax (Lange, 2022a) for the evolution... The whole training procedure is implemented in JAX, using gymnax environments (Lange, 2022b) and evosax (Lange, 2022a) for the evolution. The paper mentions software like Jax, evosax, gymnax, and Optuna but does not provide specific version numbers for these components.
Experiment Setup Yes We perform a simple grid-search over the hyperparameters to maximize the performance for the negative entropy and the ℓ2-norm mirror maps... We report the chosen hyperparameters in Appendix E. We optimize the hyper-parameters of AMPO for the negative entropy mirror map for each suite, using the hyper-parameter tuning framework Optuna (Akiba et al., 2019). We report the chosen hyper-parameters in Appendix E.