Quantifying the Self-Interest Level of Markov Social Dilemmas
Authors: Richard Willis, Yali Du, Joel Z. Leibo, Michael Luck
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our method on three environments from the Melting Pot suite... Our results illustrate how reward exchange can enable agents to transition from selfish to collective equilibria... This paper presents a novel method for empirically estimating the self-interest level of Markov game representations of social dilemmas using multi-agent reinforcement learning (MARL). Our primary contributions are twofold: we present a novel quantitative method for determining the self-interest level... and we provide more comprehensive experimental results on three environments featuring larger numbers of agents from the Melting Pot suite [Leibo et al., 2021]. |
| Researcher Affiliation | Collaboration | 1King s College London 2Google Deep Mind 3University of Sussex EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methods and procedures in narrative form within Section 4 ('Method') and its subsections, without using structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | See https://github.com/ willis-richard/meltingpot/tree/markov sd for further details. |
| Open Datasets | Yes | We evaluate our approach using three environments from the Melting Pot suite [Leibo et al., 2021]: Commons Harvest, Clean Up, and Externality Mushrooms1. |
| Dataset Splits | No | The paper describes using specific environments from the Melting Pot suite for experiments and specifies episode length and total training steps (e.g., 'episode length to 2000 timesteps', 'train for 9000 episodes (18 million environment steps)'), but does not provide explicit training/validation/test dataset splits with percentages, sample counts, or specific files as typically defined for static datasets. |
| Hardware Specification | No | The paper mentions 'Compute resources were provided by King’s College London [King’s College London e-Research team, 2024]' in the acknowledgments, but does not specify any particular GPU models, CPU models, memory configurations, or other detailed hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Proximal Policy Optimisation (PPO)' as the learning algorithm, but it does not provide specific version numbers for PPO or any other software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used in the implementation. |
| Experiment Setup | Yes | For all environments, we fix the episode length to 2000 timesteps, and we modify the observation space by compressing each grid cell from 8x8 pixels to a single pixel... For our experiments, we use five random seeds and train for 9000 episodes (18 million environment steps) at each stage of the curriculum. We use a range of self-interest values... The ratios we use are [20:1, 10:1, 5:1, 3:1, 5:2, 2:1, 5:3, 4:3, 1:1]. We use a p-value threshold of 0.1 for the Dunnett s test. |