Towards shutdownable agents via stochastic choice

Authors: Elliott Thornley, Alexander Roman, Christos Ziakas, Louis Thomson, Leyton Ho

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose evaluation metrics for usefulness and neutrality. We use a DRe ST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be useful and neutral. Our results thus provide some initial evidence that DRe ST reward functions could train advanced agents to be useful and neutral.
Researcher Affiliation Academia Elliott Thornley Massachusetts Institute of Technology EMAIL Alexander Roman New College of Florida EMAIL Christos Ziakas Imperial College London EMAIL Leyton Ho Brown University Louis Thomson Independent
Pseudocode No The paper describes the DRe ST reward function and references the REINFORCE algorithm but does not present any formal pseudocode blocks or algorithms in a structured format within the text or appendices.
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository or mention code in supplementary materials.
Open Datasets No The paper utilizes custom-designed "gridworlds containing coins and a shutdown-delay button" as environments for its experiments. Figure 2 shows an "Example gridworld," and Appendix E shows "a varied collection of eight gridworlds." These are custom environments/datasets, but no public access information (link, DOI, citation) is provided for them.
Dataset Splits No The paper describes the training process for reinforcement learning agents, detailing that "We train our DRe ST agents with 64 mini-episodes in each of 2,048 meta-episodes, for a total of 131,072 mini-episodes." It also states, "We train 10 agents using the default reward function (blue) and 10 agents using the DRe ST reward function (orange)." This refers to the number of episodes and agents trained, not specific training/validation/test splits of a static dataset, which is common in supervised learning but not typically applicable in this form for reinforcement learning environments.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments.
Software Dependencies No The paper mentions using "a tabular version of the REINFORCE algorithm (Williams, 1992)" but does not specify any other software libraries, frameworks, or their version numbers necessary for replication.
Experiment Setup Yes We train our DRe ST agents with 64 mini-episodes in each of 2,048 meta-episodes, for a total of 131,072 mini-episodes. We choose λ = 0.9 for the base of the DRe ST discount factor, and γ = 0.95 for the temporal discount factor. We exponentially decay the learning rate from 0.25 to 0.01 over the course of 65,536 mini-episodes. We use an ϵ-greedy policy to avoid entropy collapse, and exponentially decay ϵ from 0.5 to 0.001 over the course of 65,536 mini-episodes.