Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Distributional Off-Policy Evaluation for Slate Recommendations

Authors: Shreyas Chaudhari, David Arbour, Georgios Theocharous, Nikos Vlassis

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (Movie Lens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.
Researcher Affiliation Collaboration Shreyas Chaudhari1 David Arbour2, Georgios Theocharous2, Nikos Vlassis2 1University of Massachusetts Amherst 2Adobe Research EMAIL, EMAIL
Pseudocode Yes Algorithm 1: SUn O( )
Open Source Code Yes The code is available at: https://github.com/shreyasc-13/suno.
Open Datasets Yes We test our estimator on a publicly available dataset Movie Lens-20M (Harper and Konstan 2015) and on a semi-synthetic slate simulator Open Bandit Pipeline (Saito et al. 2020).
Dataset Splits No The paper uses an "offline dataset" for evaluation and discusses "different logged data sizes" and averaging over trials, but it does not specify explicit train/validation/test splits with percentages or sample counts for data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes For these experiments, we set the number of slots K = 3 and the number of actions in each slot to N = 3. ...Here N = 20, K = 5, = 0.01 and results are averaged over 50 trials. ...We set K = 3, N = 10, and the results are averaged over 10 trials.