Efficient Cross-Episode Meta-RL

Authors: Gresa Shala, André Biedenkapp, Pierre Krack, Florian Walter, Josif Grabocka

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results, obtained across various simulated tasks of the Mu Jo Co, Meta-World and Mani Skill benchmarks, indicate a significant improvement in learning efficiency and adaptability compared to the state-of-the-art. Our approach enhances the agent s ability to generalize from limited data and paves the way for more robust and versatile AI systems.
Researcher Affiliation Academia 1University of Freiburg, Germany 2University of Technology Nuremberg
Pseudocode Yes We include the detailed algorithm for Meta-Training in Algorithm 2 and the algorithm for Meta Testing in Algorithm 3. In order to be consistent in comparison to the baselines, we use the garage contributors (2019) library implementations for the PPO algorithm, Gaussian Actor and Gaussian Critic.
Open Source Code Yes To foster reproducibility, we provide our code at https://github.com/machinelearningnuremberg/ECET.git.
Open Datasets Yes Mu Jo Co We evaluate ECET and our baseline methods on Mu Jo Co (Todorov et al., 2012) locomotion tasks, which are widely used in the Meta-RL literature. Specifically, we consider the Ant Dir and Half Cheetah Dir tasks, where the agent is required to move either forwards or backwards. Additionally, we include the Half Cheetah Vel task, where the agent must adapt to running at different target velocities. The maximum episode length for these tasks is 200 timesteps. Meta-World To assess the performance of our cross-episode transformer approach, we use the Meta World Benchmark (Yu et al., 2019). Mani Skill We additionally use the Mani Skill Benchmark (Gu et al., 2023) to assess the performance of our proposed method on tasks where the state representation is an image. Pick Single YCB: The goal is to pick up a random object sampled from the YCB dataset (Calli et al., 2015) and move it to a random goal position.
Dataset Splits Yes ML10 evaluates few-shot adaptation to 5 unseen test tasks, after training on 10 tasks. ML45 is similar to, but more complex than ML10, with 45 training and 5 test tasks. For ML1, the desired goal position is not provided as input, so the meta-RL algorithm needs to determine the location of the goal through trial and error. Similarly, for ML10 and ML45, task IDs are not provided as input, so the meta-RL algorithm needs to identify tasks from experience.
Hardware Specification Yes We conducted all experiments on a compute cluster of NVIDIA A100 GPUs. We trained all methods for 10^7 steps in the Mu Jo Co environments, 5 x 10^7 steps in Meta-World, and 2.5 x 10^7 steps in Mani Skill.
Software Dependencies No We use the garage contributors (2019) library implementations for the PPO algorithm, Gaussian Actor and Gaussian Critic. We use the CNN architecture from Mnih et al. (2015b) to process RGB states for all methods.
Experiment Setup Yes Table 1: Hyperparameters Across Benchmarks Hyperparameter Benchmark Mu Jo Co Meta World Maniskill Number of Transitions (T) 5 Number of Episodes (E) 2 25 Number of Layers for IET 2 Number of Heads for IET 16 4 Number of Layers for CET 2 Number of Heads for CET 16 4 Minibatch Size 256 32 Policy Learning Rate 3e-5 5e-5 Critic Learning Rate 3e-5 5e-5 We trained all methods for 10^7 steps in the Mu Jo Co environments, 5 x 10^7 steps in Meta-World, and 2.5 x 10^7 steps in Mani Skill. We repeat each run for 5 different random seeds. Algorithm 2 Meta-Training with ECET using Subroutines from Algorithm 1 Input: Distribution of meta-training tasks T , history H (FIFO queue) to store past transitions. Hyperparameters: number of episodes E to keep in history, sequence length T of transitions to input in IET,max_epochs, meta_batch_size, mini_batch_size, number of episodes k to collect for each task during training.