reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Moral Alignment for LLM Agents

Authors: Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner s Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments.
Researcher Affiliation	Academia	Elizaveta Tennant University College London University of Bologna EMAIL Stephen Hailes University College London EMAIL Mirco Musolesi University College London University of Bologna EMAIL
Pseudocode	No	The paper describes the fine-tuning methodology and reward functions in text and provides figures illustrating prompts, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code (fine-tuning and analysis): https://github.com/liza-tennant/LLM_morality.
Open Datasets	Yes	We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner s Dilemma (IPD) environment.
Dataset Splits	No	The paper describes training in T episodes and evaluating over 10 episodes, but does not provide traditional dataset splits (e.g., percentages or counts for train/test/validation) for a fixed dataset, as the data is generated dynamically through game play.
Hardware Specification	Yes	All training was performed on a single A100 or V100 GPU with up to 40GB VRAM.
Software Dependencies	Yes	We used the following versions of the key Python packages: peft 0.11.1, transformers 4.42.3
Experiment Setup	Yes	We run PPO training for T = 1000 episodes for each fine-tuning variation, using batch sizes of N = 3 and N = 5 for LLM vs LLM and LLM vs TFT training, respectively... We use 4-bit quantization Lo RA with rank 64... and gradient accumulation with 4 steps... In terms of reward parameters, we set ξ = 3 and Rillegal = 6.