Learning Task-Aware Abstract Representations for Meta-Reinforcement Learning

Authors: Louk van Remmerden, Zhao Yang, Shujian Yu, Mark Hoogendoorn, Vincent Francois-Lavet

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical guarantees alongside empirical results, showing strong generalization performance across classical control and robotic meta-RL benchmarks, on par with state-of-the-art meta-RL methods and significantly better than non-meta RL approaches. 5 Experiments Our evaluation consists of four complementary experimental sets, each designed to test a distinct aspect of EMERALD. First, we compare EMERALD with strong baselines on out-of-distribution tasks following the evaluation protocol of Lee et al. (2020). Second, we compare EMERALD to another set of baselines on the more recent Meta-World ML1 and ML10 suites (Yu et al., 2020). Third, we evaluate the effect of training on multiple environments by measuring performance on various tasks. Lastly, we perform two ablation studies: (i) we examine how the quality of the ARM affects policy learning, and (ii) we visualize the learned task-aware latent space to assess how effectively the policy exploits it.
Researcher Affiliation Academia Louk van Remmerden EMAIL Department of Computer Science Vrije Universiteit Amsterdam Zhao Yang EMAIL Department of Computer Science Vrije Universiteit Amsterdam Shujian Yu EMAIL Department of Computer Science Vrije Universiteit Amsterdam Mark Hoogendoorn EMAIL Department of Computer Science Vrije Universiteit Amsterdam Vincent François-Lavet EMAIL Department of Computer Science Vrije Universiteit Amsterdam
Pseudocode Yes Algorithm 1 shows the pseudocode for the ARM training process. Algorithm 2 shows the pseudo-code for policy training (full version shown in Appendix D).
Open Source Code Yes 2Code base can be found at https://github.com/ljsmalbil/EMERALD. A Python implementation can be found at https://github.com/ ljsmalbil/EMERALD.
Open Datasets Yes Second, we compare EMERALD to another set of baselines on the more recent Meta-World ML1 and ML10 suites (Yu et al., 2020). across classical control and robotic meta-RL benchmarks
Dataset Splits Yes The learning algorithms are trained on a set of configurations and evaluated on previously unseen ones. For Cart Pole, for example, we train on pole lengths 1.0, 1.5, and 2.0 and test the out-of-distribution performance on poles of length 0.5 and 2.5. Following Lee et al. (2020), we evaluate all methods on unseen environments under two regimes (see Appendix J for complete specifications): 1. Moderate: Unseen test tasks differ only slightly from the training distribution. 2. Extreme: Unseen test tasks deviate substantially from training task. ML10 includes ten distinct training tasks designed to assess generalization across five test task distributions.
Hardware Specification Yes Experiments were conducted on an Apple M4 processor with 10 CPU cores and 10 GPU cores.
Software Dependencies Yes We implemented the EMERALD abstract representation model in Py Torch (Python 3.9). RL agents were implemented using Stable Baselines3 (SB3) (Raffin et al., 2021) and Clean RL (Huang et al., 2022). While the conceptual design is the same, we modified the implementation to be compatible with Open AI Gymnasium v5, which has built-in support for Mu Jo Co-based environments.
Experiment Setup Yes Table 10: Training Model Parameters Parameter Half Cheetah (Vol) Half Cheetah (Dir) Pendulum (Length) Cart Pole (Length) Weight Decay 0.00001 0.00001 0.00001 0.00001 Hidden Dim 128 128 32 32 Latent Dim 10 10 2 8 Model Learning Rate 0.0001 0.0001 0.0001 0.0001 Hidden Dim ρ 128 128 8 64 Output Dim ρ 1 1 1 1 Context Dim 4 4 12 4 Environment ID Half Cheetah Half Cheetah Pendulum Cart Pole # Envs 5 2 11 5 Table 11: Training Agent Parameters Parameter Half Cheetah (Vol) Half Cheetah (Dir) Pendulum (Length) Cart Pole (Length) Policy Network LR 0.0001 0.0001 0.0001 0.0003 Policy Hidden Dim 64 64 64 64 Gamma 0.95 0.95 0.99 0.99 Reward Weight 1 1 1 1 τ Loss Weight 1 1 1 1 Regularization Weight 0.01 0.01 0.01 0.01 Train Epochs 300 300 3000 50 Max Iterations 1000 1000 1000 200 History Length 100 100 100 100 Max Episode Steps 1000 1000 200 200 Policy Iterations Learn 5,000,000 5,000,000 2,000,00 2,000,00 Samples per Task (ARM Training) 100,000 100,000 4,500 10,000 Batch Size 256 256 256 256