Online Double Oracle

Authors: Le Cong Dinh, Stephen Marcus McAleer, Zheng Tian, Nicolas Perez-Nieves, Oliver Slumbers, David Henry Mguni, Jun Wang, Haitham Bou Ammar, Yaodong Yang

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our algorithm on tens of games including random matrix games, real-world matrix games (Czarnecki et al., 2020), and Kuhn and Leduc Poker. Results show that in almost all games, ODO outperforms both DO and PSRO variants (Lanctot et al., 2017; Mc Aleer et al., 2020), and the online learning baseline: MWU (Freund & Schapire, 1999) in terms of exploitability (i.e., distance to an NE) and average payoffs against different types of strategic adversaries.
Researcher Affiliation Collaboration Le Cong Dinh EMAIL University of Southampton Stephen Mc Aleer EMAIL University of California, Irvine Zheng Tian EMAIL Shanghai Tech University Nicolas Perez-Nieves EMAIL Imperial College London Oliver Slumbers EMAIL University College London David Henry Mguni EMAIL Huawei R&D UK Jun Wang EMAIL Huawei R&D UK Haitham Bou Ammar EMAIL Huawei R&D UK Yaodong Yang EMAIL Institute for AI, Peking University
Pseudocode Yes Algorithm 1: Online Single Oracle Algorithm
Open Source Code Yes All the codes for the experiments are attached as supplementary.
Open Datasets Yes We test our algorithm on tens of games including random matrix games, real-world matrix games (Czarnecki et al., 2020), and Kuhn and Leduc Poker.
Dataset Splits No The paper mentions using 'random matrix games', 'real-world matrix games', 'Kuhn and Leduc Poker', but does not specify any training, validation, or test splits for these games. The experiments are described in terms of running simulations or playing the games multiple times ('20 seeds for each setting').
Hardware Specification No The paper does not provide specific hardware details such as CPU models, GPU models, or memory specifications used for running the experiments. It only mentions that hyperparameters are in Appendix D, which does not contain hardware information.
Software Dependencies No The paper mentions tools like 'Open Spiel (Lanctot et al., 2019)' and 'MALib (Zhou et al., 2021)' but does not specify their version numbers. It also refers to algorithms like 'MWU', 'CFR', and 'XFP' without providing specific software implementation versions.
Experiment Setup Yes All hyperparameter setting can be found in Appendix D. Learning Rate of MWU. When facing a MWU adversary, we set the learning rate of the adversary to be optimal (i.e., µt = p 8 log(A)/T). That is, we eliminate the case when one can exploit MWU adversary with wrong learning rate. For DO-MWU algorithm, we allow the algorithm to update to next subgame when it achieves ϵ-Nash Equilibrium with ϵ = 0.01. Therefore, we optimally choose the learning rate of MWU in a subgame to be µt = 0.01 so that the DO-MWU can achieve 0.01-Nash Equilibrium in subgames in at most O(log(k)/ϵ2) steps. For a fair comparison with DO methods, we also choose the same learning rate µt = 0.01 for our ODO in the experiments.