REvolve: Reward Evolution with Large Language Models using Human Feedback
Authors: RISHI HAZRA, Alkis Sygkounas, Andreas Persson, Amy Loutfi, Pedro Zuidberg Dos Martires
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines. |
| Researcher Affiliation | Academia | Rishi Hazra , Alkis Sygkounas , Andreas Persson, Amy Loutfi & Pedro Zuidberg Dos Martires Centre for Applied Autonomous Sensor Systems (AASS) Orebro University, Sweden EMAIL |
| Pseudocode | Yes | Algorithm 1 REvolve |
| Open Source Code | Yes | Our code is open-sourced at https://github.com/Rishi Hazra/Revolve/tree/main. |
| Open Datasets | No | The paper uses the high-fidelity Air Sim simulator (Shah et al., 2018) and the Mu Jo Co simulator (Todorov et al., 2012) for experiments. While these are publicly available environments, the paper does not explicitly state that any specific dataset used or generated during the experiments is publicly available, nor does it provide concrete access information for such a dataset. |
| Dataset Splits | No | The paper conducts experiments in interactive simulation environments (Air Sim, Mu Jo Co) using reinforcement learning. In such setups, data is generated through agent-environment interactions rather than from a static dataset. Therefore, the concept of fixed training/test/validation dataset splits, as typically understood in supervised learning, does not apply. The paper does not specify any such splits for a pre-existing dataset. |
| Hardware Specification | Yes | Parallel training of a single generation on 16 NVIDIA A100 GPUs (40GB) consumed approximately 50 hours for the Air Sim environment and 24 hours for the Mu Jo Co environments. |
| Software Dependencies | No | The paper mentions using "GPT-4 Turbo (1106-preview) model" as a reward designer and "Stable Baselines3 (SB3) library" for the SAC algorithm. However, it does not provide specific version numbers for Stable Baselines3 or other software libraries like NumPy or SciPy, which are implicitly used in the provided code snippets. Therefore, a fully reproducible description with specific version numbers is not available. |
| Experiment Setup | Yes | For the evolutionary search, we set the number of generations to N = 7 and individuals per generation to K = 16. This mimics the training setup of (Ma et al., 2024a). The number of sub-populations was set to I = 13, and the mutation probability was set to pm = 0.5. For each generation, 16 policies were trained (one per reward function). In the Air Sim environment, we trained Clipped Double Deep Q-Learning (Fujimoto et al., 2018) agents for 5 × 10^5 training steps per generation per agent. For the Mu Jo Co environments, we trained Soft Actor-Critic (Haarnoja et al., 2018) agents with 5 × 10^6 steps per generation per agent. Table 4: DDQN Hyperparameters used during training for all Air Sim environments for the Autonomous Driving task. (Includes Learning Rate, Optimizer, Gamma, Epsilon Initial, Epsilon Min, Epsilon Decay, Batch Size, Tau, Replay Buffer Size, Alpha (PER), Frequency Steps Update, Beta, Action Space, Image Res, Conv Layers, Dense Units after Flatten, Sensor Input Dimensions) Key hyperparameters [for SAC] included a learning rate of 3 × 10^-4, a batch size of 256, a discount factor (γ) of 0.99, and an entropy coefficient (α) automatically adjusted during training. |