reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cross-Validated Off-Policy Evaluation

Authors: Matej Cief, Branislav Kveton, Michal Kompan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method empirically and show that it addresses a variety of use cases. We empirically evaluate the procedure on estimator selection and hyper-parameter tuning problems using nine real-world datasets.
Researcher Affiliation	Collaboration	Matej Cief1,2, Branislav Kveton3, Michal Kompan2 1Brno University of Technology 2Kempelen Institute of Intelligent Technologies 3Adobe Research*
Pseudocode	Yes	Algorithm 1: Off-policy evaluation with cross-validated estimator selection.
Open Source Code	Yes	Code https://github.com/navarog/cross-validated-ope
Open Datasets	Yes	Datasets. We take nine UCI datasets (Markelle, Longjohn, and Nottingham 2023) and convert them into contextual bandit problems
Dataset Splits	Yes	In K-fold CV, the dataset is split into K folds. We denote the validation data in the k-th fold by Dk and all other training data by ˆDk. ... We split each H into two halves, the bandit feedback dataset Hb and policy learning dataset Hπ. ... OCV is implemented as described in Algorithm 1 with K = 10.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It only mentions 'The work was done at AWS AI Labs.', which is too general.
Software Dependencies	No	The paper mentions 'ridge regression' and 'softmax function' as techniques, but does not specify any software libraries or frameworks with version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn x.x).
Experiment Setup	Yes	The reward model ˆf in all relevant estimators is learned using ridge regression with a regularization coefficient 0.001. ... We use β0 = 1 for the logging policy and β1 = 10 for the target policy. ... OCV is implemented as described in Algorithm 1 with K = 10. ... All methods are evaluated in 90 different conditions: 9 UCI ML Repository datasets (Markelle, Longjohn, and Nottingham 2023), two target policies β1 {−10, 10}, and five logging policies β0 {−3, −1, 0, 1, 3}.