Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma

Authors: Quentin Bertrand, Juan Agustin Duque, Emilio Calvano, Gauthier Gidel

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms. Section 5. Experiments
Researcher Affiliation Academia 1Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Inria, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Étienne, France 2Mila, Université de Montréal 3Università LUISS (Rome), Toulouse School of Economics, EIEF and CEPR 4Canada AI CIFAR Chair.
Pseudocode Yes Algorithm 1 Multi-agent Self-Play Q-learning Algorithm 2 ϵ-greedy
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets No For simplicity, in the experiments (Section 5), we consider simplified rewards, which are parameterized by a single scalar g, 1 < g < 2 (see Table 3 in Appendix E, as in Banchio & Mantegazza 2022).
Dataset Splits No The paper describes experiments in a simulated environment (iterated prisoner's dilemma) and does not use or specify any training/test/validation dataset splits for external datasets.
Hardware Specification Yes 5 runs are displayed in Figure 5, each run takes 3 hours to train on a single GPU on RTX8000.
Software Dependencies No The paper mentions training a deep Q-network agent (Mnih et al., 2015) but does not provide specific version numbers for any software libraries or dependencies used in its implementation.
Experiment Setup Yes Experimental Setup. In all the experiments we consider a prisoner s dilemma with a fixed incentive to cooperate g and a fixed discount factor γ: g = 1.8 and γ = 0.6. In Figures 1 to 3 the stepsize α is fixed to α = .1 Table 4. List of hyperparameters used in the deep Q-learning experiment (Figure 5). Hyperparameter Value tau 0.01 seed 8 gamma 0.8 buffer_capacity 1000000 decay_eps true eps_decay_steps 600 eps_start 0.5 eps_end 0.01 loss_type Huber Loss optimizer_type SGD hidden_size 32 num_actions 2 num_iters 10000 batch_size 16384 do_self_play true pretrain_iters 600 pretrain_vs_random true