Learning mirror maps in policy mirror descent
Authors: Carlo Alfano, Sebastian Towers, Silvia Sapora, Chris Lu, Patrick Rebeschini
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World... We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the Min Atar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments. |
| Researcher Affiliation | Academia | Carlo Alfano Department of Statistics University of Oxford Sebastian Towers FLAIR University of Oxford Silvia Sapora FLAIR, Department of Statistics University of Oxford Chris Lu FLAIR University of Oxford Patrick Rebeschini Department of Statistics University of Oxford Correspondence to EMAIL. |
| Pseudocode | No | The paper describes algorithms (PMD and AMPO) using mathematical formulations (equations 2, 5, 6, 7) and textual explanations, but it does not include any distinct pseudocode blocks or algorithm listings with structured steps. |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT The implementation of our experiments can be found at https://github.com/c-alfano/ Learning-mirror-maps. |
| Open Datasets | Yes | We first focus on a tabular environment, i.e. Grid-World (Oh et al., 2020)... We then consider two non-tabular settings, i.e. the Basic Control Suite and the Min Atar Suite... Lastly, we tackle continuous control tasks in Mu Jo Co (Todorov et al., 2012) |
| Dataset Splits | Yes | We learn a single mirror map by training PMD on a continuous distribution of Grid-World environments, and test PMD with the learned mirror map on five held-out configurations from previous publications (Oh et al., 2020; Chevalier-Boisvert et al., 2024) and on 256 randomly sampled configurations. Our last result consists in testing each learned mirror map across the other environments we consider. |
| Hardware Specification | Yes | We run on four A40 GPUs, and the optimization process takes roughly 12 hours. We run on 8 GTX 1080Ti GPUs, and the optimization process takes roughly 48 hours for a single environment. We run on eight A40 GPUs, and the optimization process takes roughly 24 hours. |
| Software Dependencies | No | The training procedure is implemented in Jax, using evosax (Lange, 2022a) for the evolution... The whole training procedure is implemented in JAX, using gymnax environments (Lange, 2022b) and evosax (Lange, 2022a) for the evolution. The paper mentions software like Jax, evosax, gymnax, and Optuna but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | We perform a simple grid-search over the hyperparameters to maximize the performance for the negative entropy and the ℓ2-norm mirror maps... We report the chosen hyperparameters in Appendix E. We optimize the hyper-parameters of AMPO for the negative entropy mirror map for each suite, using the hyper-parameter tuning framework Optuna (Akiba et al., 2019). We report the chosen hyper-parameters in Appendix E. |