Reinforcement learning with combinatorial actions for coupled restless bandits
Authors: Lily Xu, Bryan Wilder, Elias Khalil, Milind Tambe
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods which cannot address sequential planning and combinatorial selection simultaneously by an average of 24.8% on these difficult instances. |
| Researcher Affiliation | Academia | Lily Xu,1,2 Bryan Wilder,3 Elias B. Khalil,4 Milind Tambe1 1Harvard University, 2University of Oxford, 3Carnegie Mellon University, 4University of Toronto |
| Pseudocode | Yes | Algorithm 1 SEQUOIA for RL with combinatorial actions Input: MDP instance (S, A, P, R) and a set of constraints C A Parameter: Epsilon-greedy parameter ϵ, target update frequency F |
| Open Source Code | Yes | Contact: EMAIL. Code available at: https://github.com/lily-x/combinatorial-rmab |
| Open Datasets | No | For each problem instance, we randomly generate transition probabilities, constraints, and initial states. We ensure consistency by enforcing, across all algorithms, that transition dynamics and initial states across every problem instance (across every episode) are consistent across all methods. We evaluate the expected reward of each algorithm across 50 episodes of length H = 20 and average results over 30 random seeds. The paper describes a simulation setup where data is randomly generated for each problem instance rather than using a fixed publicly available dataset with specific access information. While map data for one problem is mentioned as available on a Github repo, it is not the primary experimental dataset. |
| Dataset Splits | No | For each problem instance, we randomly generate transition probabilities, constraints, and initial states. We ensure consistency by enforcing, across all algorithms, that transition dynamics and initial states across every problem instance (across every episode) are consistent across all methods. We evaluate the expected reward of each algorithm across 50 episodes of length H = 20 and average results over 30 random seeds. The paper describes a simulation-based experimental setup where data is randomly generated per instance and evaluated across episodes, but it does not specify fixed training, validation, or test splits for a static dataset. |
| Hardware Specification | Yes | Experiments are run on a cluster running Cent OS with Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.1 GHz with 8GB of RAM using Python 3.9.12. |
| Software Dependencies | Yes | Our implementation uses Py Torch and Torch RL (Bou et al., 2024) libraries. We build off code from Dumouchelle et al. (2022) to embed a neural network with single-layer Re LU activations as a MILP (Fischetti & Jo, 2018). ... The MIP was solved using Gurobi optimizer 10.0.2. ... Python 3.9.12. |
| Experiment Setup | Yes | To train the Q-network, we use the hyperparameters specified in Table 2. Table 2: Hyperparameters used in implementation to train SEQUOIA Hyperparameter Setting Discount factor γ 0.99 Epsilon greedy ϵ 0.9 starting; final 0.05; 1000 rate of exponential decay Learning rate 2.5 10 4 Adam ϵ 1.5 10 4 Target update rate τ 1 10 6 Minibatch size 32 Replay memory size 10,000 Number of training episodes 100 Network architecture Two hidden layers of size 32 each |