reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Counterfactual Learning of Stochastic Policies with Continuous Actions

Authors: Houssam Zenati, Alberto Bietti, Matthieu Martin, Eustache Diemert, Pierre Gaillard, Julien Mairal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings, outlined below, are based on experimental results and validated through diverse tasks, demonstrating the effectiveness of counterfactual learning for stochastic policies with continuous actions. We now provide an empirical evaluation of the various aspects of CRM addressed in this paper such as policy class modelling (CLP), estimation with soft-clipping, optimization with PPA, offline model selection and evaluation. We conduct such a study on synthetic and semi-synthetic datasets and on the real-world Co Co Adataset.
Researcher Affiliation	Collaboration	Houssam Zenati EMAIL Gatsby Computational Neuroscience Unit, University College London; Alberto Bietti EMAIL Center for Computational Mathematics, Flatiron Institute, New York, NY, USA; Matthieu Martin EMAIL Eustache Diemert EMAIL Criteo AI Lab; Pierre Gaillard EMAIL Julien Mairal EMAIL Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Pseudocode	Yes	Algorithm 1: Construction of the CLP Policy Set; Algorithm 2: Evaluation Protocol
Open Source Code	Yes	The code to reproduce our experiments can be found at https://github.com/ criteo-research/optimization-continuous-action-crm.
Open Datasets	Yes	Our last contribution is a small step towards solving this challenge, and consists of a new offline evaluation benchmark along with a new large-scale dataset, which we call Co Co A, obtained from a real-world system. The key idea is to introduce importance sampling diagnostics (Owen, 2013) to discard unreliable solutions along with significance tests to assess improvements to a reference policy. The Co Co A dataset is available at the following link https://drive.google.com/open?id= 1GWc YFYNqx-TSvx1bbcbuun Od Mr Le2733.
Dataset Splits	Yes	For synthetic datasets, we generate training, validation, and test sets of size 10 000 each. For the Co Co A dataset, we consider a 50%-25%-25% training-validation-test sets.
Hardware Specification	Yes	We provide code for reproducibility and all experiments were run on a CPU cluster, each node consisting on 24 CPU cores (2 x Intel(R) Xeon(R) Gold 6146 CPU@ 3.20GHz), with 500GB of RAM.
Software Dependencies	No	In our experiments, we have chosen L-BFGS because it was performing best among the solvers we tried (nonlinear conjugate gradient (CG) and Newton) and used 10 PPA iterations. No specific version numbers for software are provided.
Experiment Setup	Yes	In Table 5 we show the hyperparameters considered to run the experiments to reproduce all the results. Note that the grid of hyperparameters is larger for synthetic data. For our experiments involving anchor points, we validated the number of anchor points and kernel bandwidths similarly to other hyperparameters.