Counterfactual Learning of Stochastic Policies with Continuous Actions

Authors: Houssam Zenati, Alberto Bietti, Matthieu Martin, Eustache Diemert, Pierre Gaillard, Julien Mairal

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings, outlined below, are based on experimental results and validated through diverse tasks, demonstrating the effectiveness of counterfactual learning for stochastic policies with continuous actions. We now provide an empirical evaluation of the various aspects of CRM addressed in this paper such as policy class modelling (CLP), estimation with soft-clipping, optimization with PPA, offline model selection and evaluation. We conduct such a study on synthetic and semi-synthetic datasets and on the real-world Co Co Adataset.
Researcher Affiliation Collaboration Houssam Zenati EMAIL Gatsby Computational Neuroscience Unit, University College London; Alberto Bietti EMAIL Center for Computational Mathematics, Flatiron Institute, New York, NY, USA; Matthieu Martin EMAIL Eustache Diemert EMAIL Criteo AI Lab; Pierre Gaillard EMAIL Julien Mairal EMAIL Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Pseudocode Yes Algorithm 1: Construction of the CLP Policy Set; Algorithm 2: Evaluation Protocol
Open Source Code Yes The code to reproduce our experiments can be found at https://github.com/ criteo-research/optimization-continuous-action-crm.
Open Datasets Yes Our last contribution is a small step towards solving this challenge, and consists of a new offline evaluation benchmark along with a new large-scale dataset, which we call Co Co A, obtained from a real-world system. The key idea is to introduce importance sampling diagnostics (Owen, 2013) to discard unreliable solutions along with significance tests to assess improvements to a reference policy. The Co Co A dataset is available at the following link https://drive.google.com/open?id= 1GWc YFYNqx-TSvx1bbcbuun Od Mr Le2733.
Dataset Splits Yes For synthetic datasets, we generate training, validation, and test sets of size 10 000 each. For the Co Co A dataset, we consider a 50%-25%-25% training-validation-test sets.
Hardware Specification Yes We provide code for reproducibility and all experiments were run on a CPU cluster, each node consisting on 24 CPU cores (2 x Intel(R) Xeon(R) Gold 6146 CPU@ 3.20GHz), with 500GB of RAM.
Software Dependencies No In our experiments, we have chosen L-BFGS because it was performing best among the solvers we tried (nonlinear conjugate gradient (CG) and Newton) and used 10 PPA iterations. No specific version numbers for software are provided.
Experiment Setup Yes In Table 5 we show the hyperparameters considered to run the experiments to reproduce all the results. Note that the grid of hyperparameters is larger for synthetic data. For our experiments involving anchor points, we validated the number of anchor points and kernel bandwidths similarly to other hyperparameters.