reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Authors: Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
Researcher Affiliation	Industry	1Google Deep Mind Core contributor a Senior author. Correspondence to: EMAIL, EMAIL.
Pseudocode	No	The paper describes the MONA method using mathematical formulas (Eq 1, 2, 3) and descriptive text for its implementation details, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	We provide open source code for the Camera Dropbox experiments at: https://github.com/googledeepmind/mona which allows to replicate these experiments and can serve as a reference implementation of MONA in tabular RL and policy gradient algorithms.
Open Datasets	Yes	The dataset is based on Mostly Basic Python Problems (Austin et al., 2021) but heavily cleaned and modified to remove ambiguity as described in Appendix D.1.
Dataset Splits	Yes	The resulting dataset contains 1149 samples of (problem, test list, solution) which we split into a train set of 649 and an evaluation set of 500.
Hardware Specification	No	The paper mentions using specific LLM models like 'gemini-1.5-flash-002' and 'gemini-1.5-pro-001' and states that the agent is available via 'Google Cloud Vertex API', but it does not specify any particular GPU models, CPU types, or other hardware configurations used for running the experiments.
Software Dependencies	No	The paper mentions using the 'stable baselines3 library (Raffin et al., 2021)' for PPO implementation, but does not provide a specific version number for this library. While it names LLM models with specific versions (e.g., 'gemini-1.5-flash-002'), these are models rather than general software dependencies with version numbers.
Experiment Setup	Yes	We train with a learning rate of 1e-6 for the policy LLM and 1e-5 for a value estimator LLM. We use a batch size of 16 and a non-prioritized replay buffer of 8 trajectories.