Cross-Validated Off-Policy Evaluation

Authors: Matej Cief, Branislav Kveton, Michal Kompan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method empirically and show that it addresses a variety of use cases. We empirically evaluate the procedure on estimator selection and hyper-parameter tuning problems using nine real-world datasets.
Researcher Affiliation Collaboration Matej Cief1,2, Branislav Kveton3, Michal Kompan2 1Brno University of Technology 2Kempelen Institute of Intelligent Technologies 3Adobe Research*
Pseudocode Yes Algorithm 1: Off-policy evaluation with cross-validated estimator selection.
Open Source Code Yes Code https://github.com/navarog/cross-validated-ope
Open Datasets Yes Datasets. We take nine UCI datasets (Markelle, Longjohn, and Nottingham 2023) and convert them into contextual bandit problems
Dataset Splits Yes In K-fold CV, the dataset is split into K folds. We denote the validation data in the k-th fold by Dk and all other training data by ˆDk. ... We split each H into two halves, the bandit feedback dataset Hb and policy learning dataset Hπ. ... OCV is implemented as described in Algorithm 1 with K = 10.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It only mentions 'The work was done at AWS AI Labs.', which is too general.
Software Dependencies No The paper mentions 'ridge regression' and 'softmax function' as techniques, but does not specify any software libraries or frameworks with version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn x.x).
Experiment Setup Yes The reward model ˆf in all relevant estimators is learned using ridge regression with a regularization coefficient 0.001. ... We use β0 = 1 for the logging policy and β1 = 10 for the target policy. ... OCV is implemented as described in Algorithm 1 with K = 10. ... All methods are evaluated in 90 different conditions: 9 UCI ML Repository datasets (Markelle, Longjohn, and Nottingham 2023), two target policies β1 {−10, 10}, and five logging policies β0 {−3, −1, 0, 1, 3}.