REvolve: Reward Evolution with Large Language Models using Human Feedback

Authors: RISHI HAZRA, Alkis Sygkounas, Andreas Persson, Amy Loutfi, Pedro Zuidberg Dos Martires

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines.
Researcher Affiliation Academia Rishi Hazra , Alkis Sygkounas , Andreas Persson, Amy Loutfi & Pedro Zuidberg Dos Martires Centre for Applied Autonomous Sensor Systems (AASS) Orebro University, Sweden EMAIL
Pseudocode Yes Algorithm 1 REvolve
Open Source Code Yes Our code is open-sourced at https://github.com/Rishi Hazra/Revolve/tree/main.
Open Datasets No The paper uses the high-fidelity Air Sim simulator (Shah et al., 2018) and the Mu Jo Co simulator (Todorov et al., 2012) for experiments. While these are publicly available environments, the paper does not explicitly state that any specific dataset used or generated during the experiments is publicly available, nor does it provide concrete access information for such a dataset.
Dataset Splits No The paper conducts experiments in interactive simulation environments (Air Sim, Mu Jo Co) using reinforcement learning. In such setups, data is generated through agent-environment interactions rather than from a static dataset. Therefore, the concept of fixed training/test/validation dataset splits, as typically understood in supervised learning, does not apply. The paper does not specify any such splits for a pre-existing dataset.
Hardware Specification Yes Parallel training of a single generation on 16 NVIDIA A100 GPUs (40GB) consumed approximately 50 hours for the Air Sim environment and 24 hours for the Mu Jo Co environments.
Software Dependencies No The paper mentions using "GPT-4 Turbo (1106-preview) model" as a reward designer and "Stable Baselines3 (SB3) library" for the SAC algorithm. However, it does not provide specific version numbers for Stable Baselines3 or other software libraries like NumPy or SciPy, which are implicitly used in the provided code snippets. Therefore, a fully reproducible description with specific version numbers is not available.
Experiment Setup Yes For the evolutionary search, we set the number of generations to N = 7 and individuals per generation to K = 16. This mimics the training setup of (Ma et al., 2024a). The number of sub-populations was set to I = 13, and the mutation probability was set to pm = 0.5. For each generation, 16 policies were trained (one per reward function). In the Air Sim environment, we trained Clipped Double Deep Q-Learning (Fujimoto et al., 2018) agents for 5 × 10^5 training steps per generation per agent. For the Mu Jo Co environments, we trained Soft Actor-Critic (Haarnoja et al., 2018) agents with 5 × 10^6 steps per generation per agent. Table 4: DDQN Hyperparameters used during training for all Air Sim environments for the Autonomous Driving task. (Includes Learning Rate, Optimizer, Gamma, Epsilon Initial, Epsilon Min, Epsilon Decay, Batch Size, Tau, Replay Buffer Size, Alpha (PER), Frequency Steps Update, Beta, Action Space, Image Res, Conv Layers, Dense Units after Flatten, Sensor Input Dimensions) Key hyperparameters [for SAC] included a learning rate of 3 × 10^-4, a batch size of 256, a discount factor (γ) of 0.99, and an entropy coefficient (α) automatically adjusted during training.