reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Continuously evolving rewards in an open-ended environment

Authors: Richard M. Bailey

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulation experiments described here assess a candidate algorithm for the dynamic updating of the reward function, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment.
Researcher Affiliation	Academia	Richard M. Bailey EMAIL Oxford University Centre for the Environment OX1 3QJ, University of Oxford, UK
Pseudocode	Yes	Algorithm 1 RULE: Reward Updating through Learning and Expectation
Open Source Code	No	The paper mentions "implemented through the ML-Agents Toolkit within Unity (Juliani et al., 2020)" and "A Python API provides a low-level Python interface for communication between the simulation and relevant Py Torch libraries." However, it does not explicitly state that the authors' own code for the methodology described in this paper is publicly available or provide a link to it.
Open Datasets	No	The paper describes a "simplified ecosystem-like setting" and "abstract ecosystem-like simulated environment" used for testing. It does not utilize or provide access to a pre-existing or newly generated public dataset in the traditional sense, but rather a simulated environment.
Dataset Splits	No	The paper describes initial conditions for its simulations (e.g., "initial Ent population size and initial PP population size were set to 100") and varying parameters during experiments (e.g., "linear increase in coin supply"), but it does not specify explicit training/test/validation splits for a collected dataset as would be typical for data-driven experiments.
Hardware Specification	No	The paper states "the simulation was built using the Unity 3D game engine, with physics updates implemented using Nvidia Phys X." While it mentions a software technology for physics, it does not provide specific hardware details such as CPU models, GPU models, or other computer specifications used for running the simulations.
Software Dependencies	No	The paper mentions "Proximal policy optimisation, PPO (Schulman et al., 2017)", "ML-Agents Toolkit within Unity (Juliani et al., 2020)", and "Py Torch libraries." However, it does not provide specific version numbers for these software components.
Experiment Setup	Yes	The following hyper-parameters were used during model training: hidden-layers=2x256, gamma=0.99 (discount factor for future rewards), batch-size=32 (number of experiences in each iteration of gradient descent), buffer-size=320000 (the number of experiences collected before updating the policy), time-horizon=32 (the number of steps of experience to collect per-agent before adding to the experience buffer), learning-rate=0.000025 (gradient descent rate, held constant), beta=0.0035 (strength of the entropy regularization, which ensures exploration by adding randomness to the policy; held constant throughout learning), epsilon=0.15 (a control on the evolution rate of the policy during training, controlling the divergence between current and updated policies; held constant constant during training), lambda=0.99 (a regularisation parameter, weighting the contributions from current and updated policy).