reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Reinforcement Learning Solve Asymmetric Combinatorial-Continuous Zero-Sum Games?

Authors: Yuheng Li, Wang Panpan, Haipeng Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across diverse instances of COPs demonstrate the empirical effectiveness of our algorithms. The learned policies are tested on 200 graphs, with 100 being randomly selected from the 10,000 training graphs (to show the average reward), and the other 100 being unseen graphs (to test policy generalization). Exploitability is a common metric to describe the closeness to true NE by calculating the sum of performance distances between each new best response and subgame NE, i.e. ři 1,2 Upπbr i,k, σ i,kq Upσq in the general two-player game. From Figure 1, we can see that CCDO-RL can converge to approximate NE in 25 iterations or less (in the PG setting), reaching 0.05 in ACSP, 0.10 in ACVRP, and 0.03 in PG with 20 nodes. Similar results are observed in problems with 50 nodes (see Figure 2 in Appendix F). These results validate the effectiveness of CCDO-RL in ﬁnding the NE for various types of games. Average Reward. As illustrated in Table 1, our algorithm achieves a better average reward than baselines
Researcher Affiliation	Academia	Yuheng Li, Panpan Wang & Haipeng Chen Department of Data Science College of William & Mary EMAIL
Pseudocode	Yes	Algorithm 1 Combinatorial-Continuous Double Oracle Reinforcement Learning Algorithm Algorithm 2 Combinatorial-Continuous Double Oracle Algorithm Algorithm 3 Combinatorial-Continuous Double Oracle Approximate Algorithm
Open Source Code	Yes	The code of this work is available at https://github.com/wmd3i/CCDO-RL.
Open Datasets	Yes	For ACVRP and ACSP, we use the default data generated from RL4CO (Berto et al., 2023). For PG, we use the instances generated from a generator provided in the AI for TSP competition hosted at IJCAI21 1.
Dataset Splits	Yes	CCDO-RL is trained on a set of 10,000 graphs (with 20 or 50 nodes). The learned policies are tested on 200 graphs, with 100 being randomly selected from the 10,000 training graphs (to show the average reward), and the other 100 being unseen graphs (to test policy generalization).
Hardware Specification	Yes	All experiments on three COPs are implemented in Python and conducted on two machines. One is NVIDIA Ge Force RTX 4090 GPU and Intel 24GB 24-Core i9-13900K. The other is NVIDIA V100 GPU and Inter 256 GB 32-Core Xeon E5-2683 v4.
Software Dependencies	No	All experiments on three COPs are implemented in Python and conducted on two machines. Nashpy implementation (Knight and Campbell, 2018).
Experiment Setup	Yes	The hyperparameters of CCDO-RL are speciﬁed in Appendix D (Table 3). Table 3: Parameters of CCDO-RL parameter ACVRP20 ACVRP50 ACSP20 ACSP50 PG20 PG50 Iteration 26 35 24 35 13 12 Batchsize (prog/adv) 512 512 512 512 1024 1024 Prog training epoch 10 25 10 20 150 150 Prog training decoder sampling sampling sampling sampling sampling Top_p-sampling Prog eval/test decoder greedy greedy greedy greedy greedy beam-search Learning rate (prog) 1e-4 1e-4 1e-4 1e-4 2e-4 2.5e-4 Adv BR training epoch 5 20 20 20 20 50 Learning rate (adv) 1e-4 5e-5 5e-5 5e-5 5e-5 5e-5 Clip range (PPO in adv) 0.2 0.2 0.2 0.2 0.2 0.2 Value Func λ (PPO) 0.5 0.5 0.5 0.5 0.5 0.5 Entropy λ (PPO) 0.01 0.0 0.0 0.0 0.0 0.0 Max gradient (PPO) 0.5 0.5 0.5 0.5 0.5 0.5