Quantifying the Self-Interest Level of Markov Social Dilemmas

Authors: Richard Willis, Yali Du, Joel Z. Leibo, Michael Luck

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our method on three environments from the Melting Pot suite... Our results illustrate how reward exchange can enable agents to transition from selfish to collective equilibria... This paper presents a novel method for empirically estimating the self-interest level of Markov game representations of social dilemmas using multi-agent reinforcement learning (MARL). Our primary contributions are twofold: we present a novel quantitative method for determining the self-interest level... and we provide more comprehensive experimental results on three environments featuring larger numbers of agents from the Melting Pot suite [Leibo et al., 2021].
Researcher Affiliation Collaboration 1King s College London 2Google Deep Mind 3University of Sussex EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes its methods and procedures in narrative form within Section 4 ('Method') and its subsections, without using structured pseudocode or algorithm blocks.
Open Source Code Yes See https://github.com/ willis-richard/meltingpot/tree/markov sd for further details.
Open Datasets Yes We evaluate our approach using three environments from the Melting Pot suite [Leibo et al., 2021]: Commons Harvest, Clean Up, and Externality Mushrooms1.
Dataset Splits No The paper describes using specific environments from the Melting Pot suite for experiments and specifies episode length and total training steps (e.g., 'episode length to 2000 timesteps', 'train for 9000 episodes (18 million environment steps)'), but does not provide explicit training/validation/test dataset splits with percentages, sample counts, or specific files as typically defined for static datasets.
Hardware Specification No The paper mentions 'Compute resources were provided by King’s College London [King’s College London e-Research team, 2024]' in the acknowledgments, but does not specify any particular GPU models, CPU models, memory configurations, or other detailed hardware specifications used for running the experiments.
Software Dependencies No The paper mentions using 'Proximal Policy Optimisation (PPO)' as the learning algorithm, but it does not provide specific version numbers for PPO or any other software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used in the implementation.
Experiment Setup Yes For all environments, we fix the episode length to 2000 timesteps, and we modify the observation space by compressing each grid cell from 8x8 pixels to a single pixel... For our experiments, we use five random seeds and train for 9000 episodes (18 million environment steps) at each stage of the curriculum. We use a range of self-interest values... The ratios we use are [20:1, 10:1, 5:1, 3:1, 5:2, 2:1, 5:3, 4:3, 1:1]. We use a p-value threshold of 0.1 for the Dunnett s test.