Continuously evolving rewards in an open-ended environment
Authors: Richard M. Bailey
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation experiments described here assess a candidate algorithm for the dynamic updating of the reward function, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment. |
| Researcher Affiliation | Academia | Richard M. Bailey EMAIL Oxford University Centre for the Environment OX1 3QJ, University of Oxford, UK |
| Pseudocode | Yes | Algorithm 1 RULE: Reward Updating through Learning and Expectation |
| Open Source Code | No | The paper mentions "implemented through the ML-Agents Toolkit within Unity (Juliani et al., 2020)" and "A Python API provides a low-level Python interface for communication between the simulation and relevant Py Torch libraries." However, it does not explicitly state that the authors' own code for the methodology described in this paper is publicly available or provide a link to it. |
| Open Datasets | No | The paper describes a "simplified ecosystem-like setting" and "abstract ecosystem-like simulated environment" used for testing. It does not utilize or provide access to a pre-existing or newly generated public dataset in the traditional sense, but rather a simulated environment. |
| Dataset Splits | No | The paper describes initial conditions for its simulations (e.g., "initial Ent population size and initial PP population size were set to 100") and varying parameters during experiments (e.g., "linear increase in coin supply"), but it does not specify explicit training/test/validation splits for a collected dataset as would be typical for data-driven experiments. |
| Hardware Specification | No | The paper states "the simulation was built using the Unity 3D game engine, with physics updates implemented using Nvidia Phys X." While it mentions a software technology for physics, it does not provide specific hardware details such as CPU models, GPU models, or other computer specifications used for running the simulations. |
| Software Dependencies | No | The paper mentions "Proximal policy optimisation, PPO (Schulman et al., 2017)", "ML-Agents Toolkit within Unity (Juliani et al., 2020)", and "Py Torch libraries." However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The following hyper-parameters were used during model training: hidden-layers=2x256, gamma=0.99 (discount factor for future rewards), batch-size=32 (number of experiences in each iteration of gradient descent), buffer-size=320000 (the number of experiences collected before updating the policy), time-horizon=32 (the number of steps of experience to collect per-agent before adding to the experience buffer), learning-rate=0.000025 (gradient descent rate, held constant), beta=0.0035 (strength of the entropy regularization, which ensures exploration by adding randomness to the policy; held constant throughout learning), epsilon=0.15 (a control on the evolution rate of the policy during training, controlling the divergence between current and updated policies; held constant constant during training), lambda=0.99 (a regularisation parameter, weighting the contributions from current and updated policy). |