Meta-learning Population-based Methods for Reinforcement Learning

Authors: Johannes Hog, Raghu Rajan, André Biedenkapp, Noor Awad, Frank Hutter, Vu Nguyen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section begins with details on our experimental setup, followed by the presentation of our research findings that address the following research questions. RQ1 How do our methods compare in performance to standard baselines and each other? ... We evaluated our approaches on two sets of RL environments, employing CARL (Benjamins et al., 2023) to generate several slightly different versions of each.
Researcher Affiliation Collaboration Johannes Hog EMAIL University of Freiburg, Germany; Frank Hutter EMAIL ELLIS Institute Tübingen, Germany & University of Freiburg, Germany; Vu Nguyen EMAIL Amazon, Australia
Pseudocode Yes The pseudo-code for the PB2 algorithm is provided in Appendix A.2. Algorithm 1 Portfolio Construction. Algorithm 2 PB2. Algorithm 3 RGPE weighting.
Open Source Code Yes The code to reproduce our results is publicly available at https://github.com/automl/Meta PB2.
Open Datasets Yes We evaluated our approaches on two sets of RL environments, employing CARL (Benjamins et al., 2023) to generate several slightly different versions of each. The first set of environments is classic control (Brockman et al., 2016) specifically mountain_car, cart_pole, pendulum, acrobot. We used this cheaper environment to set our methods hyper-hyperparameters and conduct more compute-intensive experiments. The second set of environments is Brax (Freeman et al., 2021) where CARL allowed us to modify 9 environments (see Appendix B.1).
Dataset Splits No The paper describes using different environments (classic control and Brax) for hyper-hyperparameter tuning and main experiments, and running experiments for 10 seeds. However, it does not specify explicit training/test/validation splits for a static dataset in the conventional sense, as is common in reinforcement learning where data is generated through interaction with environments.
Hardware Specification Yes For classic control experiments, we used machines with Intel Xeon Gold 6242 processors at 2.80 GHz, whereas for Brax, we utilized machines with Intel Xeon E5-2630v4 processors at 2.2 GHz.
Software Dependencies No The paper mentions software like PPO (Schulman et al., 2017), Ray (Liaw et al., 2018), Autorank (Herbold, 2020) Python package, and CARL (Benjamins et al., 2023). However, it does not provide specific version numbers for these software components.
Experiment Setup Yes We repeat each of our experiments for 10 seeds and ensure that the initial random configurations are the same for each method given the same seed. Our population-based algorithms split the training into 15 equally sized perturbation intervals. At each of the perturbation intervals, we evaluate every agent for 10 episodes during training. ... The batch size is fixed to 20.000 and 25.000 for classic control and Brax, respectively. The search space is shown in Table 1 (Learning rate (1e-5, 1e-3), Lambda (0.9, 0.99), Clip parameter (0.1, 0.5)). For classic control, we trained the agents on each environment for 600,000 steps with a perturbation interval of 40,000. The agents on the Brax environments were each trained for 3,000,000 steps with a perturbation interval of 200,000.