Cross-Validated Off-Policy Evaluation
Authors: Matej Cief, Branislav Kveton, Michal Kompan
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method empirically and show that it addresses a variety of use cases. We empirically evaluate the procedure on estimator selection and hyper-parameter tuning problems using nine real-world datasets. |
| Researcher Affiliation | Collaboration | Matej Cief1,2, Branislav Kveton3, Michal Kompan2 1Brno University of Technology 2Kempelen Institute of Intelligent Technologies 3Adobe Research* |
| Pseudocode | Yes | Algorithm 1: Off-policy evaluation with cross-validated estimator selection. |
| Open Source Code | Yes | Code https://github.com/navarog/cross-validated-ope |
| Open Datasets | Yes | Datasets. We take nine UCI datasets (Markelle, Longjohn, and Nottingham 2023) and convert them into contextual bandit problems |
| Dataset Splits | Yes | In K-fold CV, the dataset is split into K folds. We denote the validation data in the k-th fold by Dk and all other training data by ˆDk. ... We split each H into two halves, the bandit feedback dataset Hb and policy learning dataset Hπ. ... OCV is implemented as described in Algorithm 1 with K = 10. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It only mentions 'The work was done at AWS AI Labs.', which is too general. |
| Software Dependencies | No | The paper mentions 'ridge regression' and 'softmax function' as techniques, but does not specify any software libraries or frameworks with version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn x.x). |
| Experiment Setup | Yes | The reward model ˆf in all relevant estimators is learned using ridge regression with a regularization coefficient 0.001. ... We use β0 = 1 for the logging policy and β1 = 10 for the target policy. ... OCV is implemented as described in Algorithm 1 with K = 10. ... All methods are evaluated in 90 different conditions: 9 UCI ML Repository datasets (Markelle, Longjohn, and Nottingham 2023), two target policies β1 {−10, 10}, and five logging policies β0 {−3, −1, 0, 1, 3}. |