Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma
Authors: Quentin Bertrand, Juan Agustin Duque, Emilio Calvano, Gauthier Gidel
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms. Section 5. Experiments |
| Researcher Affiliation | Academia | 1Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Inria, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Étienne, France 2Mila, Université de Montréal 3Università LUISS (Rome), Toulouse School of Economics, EIEF and CEPR 4Canada AI CIFAR Chair. |
| Pseudocode | Yes | Algorithm 1 Multi-agent Self-Play Q-learning Algorithm 2 ϵ-greedy |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | No | For simplicity, in the experiments (Section 5), we consider simplified rewards, which are parameterized by a single scalar g, 1 < g < 2 (see Table 3 in Appendix E, as in Banchio & Mantegazza 2022). |
| Dataset Splits | No | The paper describes experiments in a simulated environment (iterated prisoner's dilemma) and does not use or specify any training/test/validation dataset splits for external datasets. |
| Hardware Specification | Yes | 5 runs are displayed in Figure 5, each run takes 3 hours to train on a single GPU on RTX8000. |
| Software Dependencies | No | The paper mentions training a deep Q-network agent (Mnih et al., 2015) but does not provide specific version numbers for any software libraries or dependencies used in its implementation. |
| Experiment Setup | Yes | Experimental Setup. In all the experiments we consider a prisoner s dilemma with a fixed incentive to cooperate g and a fixed discount factor γ: g = 1.8 and γ = 0.6. In Figures 1 to 3 the stepsize α is fixed to α = .1 Table 4. List of hyperparameters used in the deep Q-learning experiment (Figure 5). Hyperparameter Value tau 0.01 seed 8 gamma 0.8 buffer_capacity 1000000 decay_eps true eps_decay_steps 600 eps_start 0.5 eps_end 0.01 loss_type Huber Loss optimizer_type SGD hidden_size 32 num_actions 2 num_iters 10000 batch_size 16384 do_self_play true pretrain_iters 600 pretrain_vs_random true |