reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma

Authors: Quentin Bertrand, Juan Agustin Duque, Emilio Calvano, Gauthier Gidel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms. Section 5. Experiments
Researcher Affiliation	Academia	1Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Inria, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Étienne, France 2Mila, Université de Montréal 3Università LUISS (Rome), Toulouse School of Economics, EIEF and CEPR 4Canada AI CIFAR Chair.
Pseudocode	Yes	Algorithm 1 Multi-agent Self-Play Q-learning Algorithm 2 ϵ-greedy
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets	No	For simplicity, in the experiments (Section 5), we consider simplified rewards, which are parameterized by a single scalar g, 1 < g < 2 (see Table 3 in Appendix E, as in Banchio & Mantegazza 2022).
Dataset Splits	No	The paper describes experiments in a simulated environment (iterated prisoner's dilemma) and does not use or specify any training/test/validation dataset splits for external datasets.
Hardware Specification	Yes	5 runs are displayed in Figure 5, each run takes 3 hours to train on a single GPU on RTX8000.
Software Dependencies	No	The paper mentions training a deep Q-network agent (Mnih et al., 2015) but does not provide specific version numbers for any software libraries or dependencies used in its implementation.
Experiment Setup	Yes	Experimental Setup. In all the experiments we consider a prisoner s dilemma with a fixed incentive to cooperate g and a fixed discount factor γ: g = 1.8 and γ = 0.6. In Figures 1 to 3 the stepsize α is fixed to α = .1 Table 4. List of hyperparameters used in the deep Q-learning experiment (Figure 5). Hyperparameter Value tau 0.01 seed 8 gamma 0.8 buffer_capacity 1000000 decay_eps true eps_decay_steps 600 eps_start 0.5 eps_end 0.01 loss_type Huber Loss optimizer_type SGD hidden_size 32 num_actions 2 num_iters 10000 batch_size 16384 do_self_play true pretrain_iters 600 pretrain_vs_random true