Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback

Authors: Shishen Lin

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on diverse matrix game benchmarks demonstrate that COEBL not only achieves sublinear regret but also consistently outperforms classical bandit algorithms, including EXP3, the variant EXP3-IX, and UCB. These results highlight the potential of evolutionary bandit learning, particularly the efficacy of randomised optimism via evolutionary algorithms in gametheoretic settings. In this section, we present empirical results comparing the discussed algorithms. We conducted extensive experiments on three matrix games: the RPS game, the DIAGONAL game, and the BIGGERNUMBER game.
Researcher Affiliation Academia Shishen Lin School of Computer Science, University of Birmingham, Birmingham, United Kingdom EMAIL
Pseudocode Yes Algorithm 1 COEBL for matrix games
Open Source Code No The paper does not provide an explicit statement about releasing the source code or a link to a code repository for the methodology described.
Open Datasets Yes We consider the classic matrix game benchmark: Rock Paper-Scissors games [Littman, 1994; O Donoghue et al., 2021], and its payoff matrix is defined as follows. DIAGONAL is a pseudo-Boolean maximin-benchmark on which Lehre and Lin [2024b] conducted runtime analysis of coevolutionary algorithms. BIGGERNUMBER is another challenging two-player zero-sum game proposed and analysed by [Zhang and Sandholm, 2024].
Dataset Splits No The paper describes experiments on matrix games which are not typically partitioned into training/test/validation splits in the traditional machine learning sense. The experiments involve iterations of game play and simulations, but no explicit data splitting strategy is mentioned.
Hardware Specification No The computations were performed using the University of Birmingham s Blue BEAR high performance computing (HPC) service. This mentions a service but lacks specific hardware details like GPU/CPU models.
Software Dependencies No The paper does not explicitly mention any software dependencies with specific version numbers.
Experiment Setup Yes For EXP3, we use the exploration rate γt = min{ pK log K/t, 1} and learning rate ηt = p2 log K/t K as suggested in [O Donoghue et al., 2021]. For the variant of EXP3-IX, we use the same settings ηt = t kη, βt = t kβ, εt = t kε where kη = 5/8, kβ = 3/8, kε = 1/8 as suggested in [Cai et al., 2023]. For COEBL, we set the mutation rate c = 2 for the RPS game and c = 8 for the rest of the games. There is no hyper-parameter needed for UCB. We compute the empirical mean of the regrets and the KL-divergence (or total variation distance), and present the 95% confidence intervals in the plots. We run 50 independent simulations (up to 3000 iterations) for each configuration (over 50 seeds).