Continuously evolving rewards in an open-ended environment

Authors: Richard M. Bailey

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulation experiments described here assess a candidate algorithm for the dynamic updating of the reward function, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment.
Researcher Affiliation Academia Richard M. Bailey EMAIL Oxford University Centre for the Environment OX1 3QJ, University of Oxford, UK
Pseudocode Yes Algorithm 1 RULE: Reward Updating through Learning and Expectation
Open Source Code No The paper mentions "implemented through the ML-Agents Toolkit within Unity (Juliani et al., 2020)" and "A Python API provides a low-level Python interface for communication between the simulation and relevant Py Torch libraries." However, it does not explicitly state that the authors' own code for the methodology described in this paper is publicly available or provide a link to it.
Open Datasets No The paper describes a "simplified ecosystem-like setting" and "abstract ecosystem-like simulated environment" used for testing. It does not utilize or provide access to a pre-existing or newly generated public dataset in the traditional sense, but rather a simulated environment.
Dataset Splits No The paper describes initial conditions for its simulations (e.g., "initial Ent population size and initial PP population size were set to 100") and varying parameters during experiments (e.g., "linear increase in coin supply"), but it does not specify explicit training/test/validation splits for a collected dataset as would be typical for data-driven experiments.
Hardware Specification No The paper states "the simulation was built using the Unity 3D game engine, with physics updates implemented using Nvidia Phys X." While it mentions a software technology for physics, it does not provide specific hardware details such as CPU models, GPU models, or other computer specifications used for running the simulations.
Software Dependencies No The paper mentions "Proximal policy optimisation, PPO (Schulman et al., 2017)", "ML-Agents Toolkit within Unity (Juliani et al., 2020)", and "Py Torch libraries." However, it does not provide specific version numbers for these software components.
Experiment Setup Yes The following hyper-parameters were used during model training: hidden-layers=2x256, gamma=0.99 (discount factor for future rewards), batch-size=32 (number of experiences in each iteration of gradient descent), buffer-size=320000 (the number of experiences collected before updating the policy), time-horizon=32 (the number of steps of experience to collect per-agent before adding to the experience buffer), learning-rate=0.000025 (gradient descent rate, held constant), beta=0.0035 (strength of the entropy regularization, which ensures exploration by adding randomness to the policy; held constant throughout learning), epsilon=0.15 (a control on the evolution rate of the policy during training, controlling the divergence between current and updated policies; held constant constant during training), lambda=0.99 (a regularisation parameter, weighting the contributions from current and updated policy).