MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Authors: Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering. |
| Researcher Affiliation | Industry | 1Google Deep Mind Core contributor a Senior author. Correspondence to: EMAIL, EMAIL. |
| Pseudocode | No | The paper describes the MONA method using mathematical formulas (Eq 1, 2, 3) and descriptive text for its implementation details, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | We provide open source code for the Camera Dropbox experiments at: https://github.com/googledeepmind/mona which allows to replicate these experiments and can serve as a reference implementation of MONA in tabular RL and policy gradient algorithms. |
| Open Datasets | Yes | The dataset is based on Mostly Basic Python Problems (Austin et al., 2021) but heavily cleaned and modified to remove ambiguity as described in Appendix D.1. |
| Dataset Splits | Yes | The resulting dataset contains 1149 samples of (problem, test list, solution) which we split into a train set of 649 and an evaluation set of 500. |
| Hardware Specification | No | The paper mentions using specific LLM models like 'gemini-1.5-flash-002' and 'gemini-1.5-pro-001' and states that the agent is available via 'Google Cloud Vertex API', but it does not specify any particular GPU models, CPU types, or other hardware configurations used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'stable baselines3 library (Raffin et al., 2021)' for PPO implementation, but does not provide a specific version number for this library. While it names LLM models with specific versions (e.g., 'gemini-1.5-flash-002'), these are models rather than general software dependencies with version numbers. |
| Experiment Setup | Yes | We train with a learning rate of 1e-6 for the policy LLM and 1e-5 for a value estimator LLM. We use a batch size of 16 and a non-prioritized replay buffer of 8 trajectories. |