Moral Alignment for LLM Agents
Authors: Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner s Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. |
| Researcher Affiliation | Academia | Elizaveta Tennant University College London University of Bologna EMAIL Stephen Hailes University College London EMAIL Mirco Musolesi University College London University of Bologna EMAIL |
| Pseudocode | No | The paper describes the fine-tuning methodology and reward functions in text and provides figures illustrating prompts, but it does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code (fine-tuning and analysis): https://github.com/liza-tennant/LLM_morality. |
| Open Datasets | Yes | We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner s Dilemma (IPD) environment. |
| Dataset Splits | No | The paper describes training in T episodes and evaluating over 10 episodes, but does not provide traditional dataset splits (e.g., percentages or counts for train/test/validation) for a fixed dataset, as the data is generated dynamically through game play. |
| Hardware Specification | Yes | All training was performed on a single A100 or V100 GPU with up to 40GB VRAM. |
| Software Dependencies | Yes | We used the following versions of the key Python packages: peft 0.11.1, transformers 4.42.3 |
| Experiment Setup | Yes | We run PPO training for T = 1000 episodes for each fine-tuning variation, using batch sizes of N = 3 and N = 5 for LLM vs LLM and LLM vs TFT training, respectively... We use 4-bit quantization Lo RA with rank 64... and gradient accumulation with 4 steps... In terms of reward parameters, we set ΞΎ = 3 and Rillegal = 6. |