Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization

Authors: Runlong Zhou, Zelin He, Yuandong Tian, Yi Wu, Simon Shaolei Du

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and Ad Words to verify our findings.
Researcher Affiliation Collaboration Runlong Zhou EMAIL University of Washington Zelin He EMAIL University of Washington Yuandong Tian EMAIL Facebook AI Research Yi Wu EMAIL Tsinghua University Simon S. Du EMAIL University of Washington
Pseudocode Yes Algorithm 1 NPG (Full version: Algorithm 3) Algorithm 2 Curriculum learning framework. Algorithm 3 NPG: Sample-based NPG (full version). Algorithm 4 Sample: Sampler for s d΀samp m,h where m Multinomial pw1, . . . , w Mq, a Unif A if unif
Open Source Code Yes Due to page limit, more formulation details and results are presented in Appendix C, and code can be found at https://github.com/zhourunlong/ RL-for-Combinatorial-Optimization.
Open Datasets No For BCP, each instance is a permutation of length n, and in each round an instance is drawn from an unknown distribution over all permutations. In Online Knapsack problems the decision-maker observes n (which is known) items arriving sequentially, each with value vi and size si revealed upon arrival. In ADW, there are n advertisers each with budget 1 and m ad slots. Each ad slot j arrives sequentially along with a vector pv1,j, v2,j, . . . , vn,jq where vi,j is the value that advertiser i wants to pay for ad slot j.
Dataset Splits Yes For the warm-up phases we set n 10 and for final phases n 100.
Hardware Specification Yes All the experiments were run on a server with CPU AMD Ryzen 9 3950X, GPU NVIDIA Ge Force 2080 Super and 128G memory.
Software Dependencies No The paper does not explicitly mention any specific software dependencies with version numbers.
Experiment Setup Yes They shared a learning rate of 0.2, batch size of 100 per step in horizon, final n 100 and warm-up n 10 (if applied curriculum learning).