Multi-Armed Bandits with Interference: Bridging Causal Inference and Adversarial Bandits

Authors: Su Jia, Peter I. Frazier, Nathan Kallus

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We visualize the results in Figures 4 and 5. Consistent with the theoretical analysis, CR outperforms SB in terms of 95% percentile regret, without sacrificing mean regret. Moreover, we observe that when N = T 3, the regret exhibits a smaller deviation. Finally, we observe that when N = T 3, the 95 regret percentile of CR is lower compared to the N = T 2 case. This is reasonable since a larger N helps reduce the variance in the reward estimation.
Researcher Affiliation Academia 1Center for Data Science for Enterprise and Society, Cornell University 2School of Operations Research and Information Engineering, Cornell University.
Pseudocode Yes Algorithm 1 EXP3-HT-IX Policy 1: Input: η (0, 1): learning rate, β [0, 1/2): IX parameter for the HT-IX estimator, (ℓ, r): parameters for the RRP.
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets No We consider a 2-armed setting with N units lying on a N lattice. We generate unit-level interference as follows. Each unit u [N] is assigned a random reward Rut.
Dataset Splits No The paper describes generating synthetic data and running experiments on 'instances' but does not provide specific training/test/validation dataset splits for a pre-existing dataset.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions comparing 'EXP3-IX' and 'EXP3-IX-HT' policies, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, specific libraries).
Experiment Setup Yes We consider a 2-armed setting with N units lying on a N lattice. We generate unit-level interference as follows. Each unit u [N] is assigned a random reward Rut. Let ρut be the proportion of the five immediate neighbors (counting u itself) assigned arm 1 at time t. Then, the reward at u is (2ρut 1)cut. In two sets of experiments, we assume that N = T 2 and N = T 3 respectively, and let T range from 10, 20, . . . , 50. For each fixed N, T, we randomly generate 100 instances. To necessitate exploration, we add large-scale non-stationarity by randomly generating drifts. Each drift is an 8-piecewise constant function, where the value on each piece is independently drawn from U(0, 1). To align with the theoretical analysis, we partition the lattice into square-shaped clusters of side length N 1/4. For simplicity, we perform a simplified version of the clustering without randomly assigning the boundary units to nearby clusters. We compare the switchback version of EXP3-IX (denoted SB) and our clustered randomization-based policy, dubbed EXP3-IX-HT, and denoted CR in the figures. We evaluated the performance of the two policies by running each of them 200 times for each instance.