reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning mirror maps in policy mirror descent

Authors: Carlo Alfano, Sebastian Towers, Silvia Sapora, Chris Lu, Patrick Rebeschini

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World... We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the Min Atar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.
Researcher Affiliation	Academia	Carlo Alfano Department of Statistics University of Oxford Sebastian Towers FLAIR University of Oxford Silvia Sapora FLAIR, Department of Statistics University of Oxford Chris Lu FLAIR University of Oxford Patrick Rebeschini Department of Statistics University of Oxford Correspondence to EMAIL.
Pseudocode	No	The paper describes algorithms (PMD and AMPO) using mathematical formulations (equations 2, 5, 6, 7) and textual explanations, but it does not include any distinct pseudocode blocks or algorithm listings with structured steps.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT The implementation of our experiments can be found at https://github.com/c-alfano/ Learning-mirror-maps.
Open Datasets	Yes	We first focus on a tabular environment, i.e. Grid-World (Oh et al., 2020)... We then consider two non-tabular settings, i.e. the Basic Control Suite and the Min Atar Suite... Lastly, we tackle continuous control tasks in Mu Jo Co (Todorov et al., 2012)
Dataset Splits	Yes	We learn a single mirror map by training PMD on a continuous distribution of Grid-World environments, and test PMD with the learned mirror map on five held-out configurations from previous publications (Oh et al., 2020; Chevalier-Boisvert et al., 2024) and on 256 randomly sampled configurations. Our last result consists in testing each learned mirror map across the other environments we consider.
Hardware Specification	Yes	We run on four A40 GPUs, and the optimization process takes roughly 12 hours. We run on 8 GTX 1080Ti GPUs, and the optimization process takes roughly 48 hours for a single environment. We run on eight A40 GPUs, and the optimization process takes roughly 24 hours.
Software Dependencies	No	The training procedure is implemented in Jax, using evosax (Lange, 2022a) for the evolution... The whole training procedure is implemented in JAX, using gymnax environments (Lange, 2022b) and evosax (Lange, 2022a) for the evolution. The paper mentions software like Jax, evosax, gymnax, and Optuna but does not provide specific version numbers for these components.
Experiment Setup	Yes	We perform a simple grid-search over the hyperparameters to maximize the performance for the negative entropy and the ℓ2-norm mirror maps... We report the chosen hyperparameters in Appendix E. We optimize the hyper-parameters of AMPO for the negative entropy mirror map for each suite, using the hyper-parameter tuning framework Optuna (Akiba et al., 2019). We report the chosen hyper-parameters in Appendix E.