Can Reinforcement Learning Solve Asymmetric Combinatorial-Continuous Zero-Sum Games?

Authors: Yuheng Li, Wang Panpan, Haipeng Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across diverse instances of COPs demonstrate the empirical effectiveness of our algorithms. The learned policies are tested on 200 graphs, with 100 being randomly selected from the 10,000 training graphs (to show the average reward), and the other 100 being unseen graphs (to test policy generalization). Exploitability is a common metric to describe the closeness to true NE by calculating the sum of performance distances between each new best response and subgame NE, i.e. ři 1,2 Upπbr i,k, σ i,kq Upσq in the general two-player game. From Figure 1, we can see that CCDO-RL can converge to approximate NE in 25 iterations or less (in the PG setting), reaching 0.05 in ACSP, 0.10 in ACVRP, and 0.03 in PG with 20 nodes. Similar results are observed in problems with 50 nodes (see Figure 2 in Appendix F). These results validate the effectiveness of CCDO-RL in finding the NE for various types of games. Average Reward. As illustrated in Table 1, our algorithm achieves a better average reward than baselines
Researcher Affiliation Academia Yuheng Li, Panpan Wang & Haipeng Chen Department of Data Science College of William & Mary EMAIL
Pseudocode Yes Algorithm 1 Combinatorial-Continuous Double Oracle Reinforcement Learning Algorithm Algorithm 2 Combinatorial-Continuous Double Oracle Algorithm Algorithm 3 Combinatorial-Continuous Double Oracle Approximate Algorithm
Open Source Code Yes The code of this work is available at https://github.com/wmd3i/CCDO-RL.
Open Datasets Yes For ACVRP and ACSP, we use the default data generated from RL4CO (Berto et al., 2023). For PG, we use the instances generated from a generator provided in the AI for TSP competition hosted at IJCAI21 1.
Dataset Splits Yes CCDO-RL is trained on a set of 10,000 graphs (with 20 or 50 nodes). The learned policies are tested on 200 graphs, with 100 being randomly selected from the 10,000 training graphs (to show the average reward), and the other 100 being unseen graphs (to test policy generalization).
Hardware Specification Yes All experiments on three COPs are implemented in Python and conducted on two machines. One is NVIDIA Ge Force RTX 4090 GPU and Intel 24GB 24-Core i9-13900K. The other is NVIDIA V100 GPU and Inter 256 GB 32-Core Xeon E5-2683 v4.
Software Dependencies No All experiments on three COPs are implemented in Python and conducted on two machines. Nashpy implementation (Knight and Campbell, 2018).
Experiment Setup Yes The hyperparameters of CCDO-RL are specified in Appendix D (Table 3). Table 3: Parameters of CCDO-RL parameter ACVRP20 ACVRP50 ACSP20 ACSP50 PG20 PG50 Iteration 26 35 24 35 13 12 Batchsize (prog/adv) 512 512 512 512 1024 1024 Prog training epoch 10 25 10 20 150 150 Prog training decoder sampling sampling sampling sampling sampling Top_p-sampling Prog eval/test decoder greedy greedy greedy greedy greedy beam-search Learning rate (prog) 1e-4 1e-4 1e-4 1e-4 2e-4 2.5e-4 Adv BR training epoch 5 20 20 20 20 50 Learning rate (adv) 1e-4 5e-5 5e-5 5e-5 5e-5 5e-5 Clip range (PPO in adv) 0.2 0.2 0.2 0.2 0.2 0.2 Value Func λ (PPO) 0.5 0.5 0.5 0.5 0.5 0.5 Entropy λ (PPO) 0.01 0.0 0.0 0.0 0.0 0.0 Max gradient (PPO) 0.5 0.5 0.5 0.5 0.5 0.5