A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence

Authors: Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we apply QFR in 4-Sided Liar s Dice, Leduc Poker (Southey et al., 2005), Kuhn Poker (Kuhn, 1950), and 2 2 Abrupt Dark Hex. The experimental result of Algorithm 1 is presented in Figure 1. Figure 1 shows that QFR outperforms outcome-sampling CFR, CFR+, and BOMD in all games.
Researcher Affiliation Academia Mingyang Liu, Gabriele Farina & Asuman Ozdaglar LIDS, EECS Massachusetts Institute of Technology Cambridge, MA 02139, USA EMAIL
Pseudocode Yes Algorithm 1 Q-Function based Regret minimization (QFR)
Open Source Code Yes The code of QFR and baselines for tabular games can be found in Lite EFG3 (Liu et al., 2024). 3https://github.com/liumy2010/Lite EFG/tree/main/Lite EFG/baselines
Open Datasets Yes In the experiments, we apply QFR in 4-Sided Liar s Dice, Leduc Poker (Southey et al., 2005), Kuhn Poker (Kuhn, 1950), and 2 2 Abrupt Dark Hex. The code is based on Lite EFG (Liu et al., 2024) with game environments implemented by Open Spiel (Lanctot et al., 2019).
Dataset Splits No The paper uses extensive-form games (e.g., Leduc Poker, Kuhn Poker) as environments for experimentation. These are dynamic environments where agents interact, and performance is measured through metrics like 'exploitability' over iterations rather than on fixed training, validation, and test splits. The paper does not describe any specific dataset splits in the conventional sense for reproducibility.
Hardware Specification Yes Figure 1 and Figure 2 are conducted on 240 cores of Intel Xeon Platinum 8260 and Figure 3 is conducted on Intel(R) Xeon Gold 6248 with NVidia Volta V100.
Software Dependencies No The code is based on Lite EFG (Liu et al., 2024) with game environments implemented by Open Spiel (Lanctot et al., 2019). Our implementation of QFR is based on PPO (Schulman et al., 2017) in Clean RL (Huang et al., 2022). While these are specific software components and frameworks, no version numbers are provided for reproducibility.
Experiment Setup Yes In order to pick hyperparameters, we performed a grid-search for QFR and MMD on learning rate η, regularization τ, perturbation γ, and the regularizer is either negative entropy or Euclidean distance. For Balanced OMD (BOMD) (Bai et al., 2022) and Balanced FTRL (Fiegel et al., 2023), we applied grid search to the learning rate η and fixed the exploration rate (IX parameter) to η 20 as suggested in Fiegel et al. (2023).