Moral Alignment for LLM Agents

Authors: Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner s Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments.
Researcher Affiliation Academia Elizaveta Tennant University College London University of Bologna EMAIL Stephen Hailes University College London EMAIL Mirco Musolesi University College London University of Bologna EMAIL
Pseudocode No The paper describes the fine-tuning methodology and reward functions in text and provides figures illustrating prompts, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code (fine-tuning and analysis): https://github.com/liza-tennant/LLM_morality.
Open Datasets Yes We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner s Dilemma (IPD) environment.
Dataset Splits No The paper describes training in T episodes and evaluating over 10 episodes, but does not provide traditional dataset splits (e.g., percentages or counts for train/test/validation) for a fixed dataset, as the data is generated dynamically through game play.
Hardware Specification Yes All training was performed on a single A100 or V100 GPU with up to 40GB VRAM.
Software Dependencies Yes We used the following versions of the key Python packages: peft 0.11.1, transformers 4.42.3
Experiment Setup Yes We run PPO training for T = 1000 episodes for each fine-tuning variation, using batch sizes of N = 3 and N = 5 for LLM vs LLM and LLM vs TFT training, respectively... We use 4-bit quantization Lo RA with rank 64... and gradient accumulation with 4 steps... In terms of reward parameters, we set ΞΎ = 3 and Rillegal = 6.