Counterfactual Learning of Stochastic Policies with Continuous Actions
Authors: Houssam Zenati, Alberto Bietti, Matthieu Martin, Eustache Diemert, Pierre Gaillard, Julien Mairal
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings, outlined below, are based on experimental results and validated through diverse tasks, demonstrating the effectiveness of counterfactual learning for stochastic policies with continuous actions. We now provide an empirical evaluation of the various aspects of CRM addressed in this paper such as policy class modelling (CLP), estimation with soft-clipping, optimization with PPA, offline model selection and evaluation. We conduct such a study on synthetic and semi-synthetic datasets and on the real-world Co Co Adataset. |
| Researcher Affiliation | Collaboration | Houssam Zenati EMAIL Gatsby Computational Neuroscience Unit, University College London; Alberto Bietti EMAIL Center for Computational Mathematics, Flatiron Institute, New York, NY, USA; Matthieu Martin EMAIL Eustache Diemert EMAIL Criteo AI Lab; Pierre Gaillard EMAIL Julien Mairal EMAIL Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France |
| Pseudocode | Yes | Algorithm 1: Construction of the CLP Policy Set; Algorithm 2: Evaluation Protocol |
| Open Source Code | Yes | The code to reproduce our experiments can be found at https://github.com/ criteo-research/optimization-continuous-action-crm. |
| Open Datasets | Yes | Our last contribution is a small step towards solving this challenge, and consists of a new offline evaluation benchmark along with a new large-scale dataset, which we call Co Co A, obtained from a real-world system. The key idea is to introduce importance sampling diagnostics (Owen, 2013) to discard unreliable solutions along with significance tests to assess improvements to a reference policy. The Co Co A dataset is available at the following link https://drive.google.com/open?id= 1GWc YFYNqx-TSvx1bbcbuun Od Mr Le2733. |
| Dataset Splits | Yes | For synthetic datasets, we generate training, validation, and test sets of size 10 000 each. For the Co Co A dataset, we consider a 50%-25%-25% training-validation-test sets. |
| Hardware Specification | Yes | We provide code for reproducibility and all experiments were run on a CPU cluster, each node consisting on 24 CPU cores (2 x Intel(R) Xeon(R) Gold 6146 CPU@ 3.20GHz), with 500GB of RAM. |
| Software Dependencies | No | In our experiments, we have chosen L-BFGS because it was performing best among the solvers we tried (nonlinear conjugate gradient (CG) and Newton) and used 10 PPA iterations. No specific version numbers for software are provided. |
| Experiment Setup | Yes | In Table 5 we show the hyperparameters considered to run the experiments to reproduce all the results. Note that the grid of hyperparameters is larger for synthetic data. For our experiments involving anchor points, we validated the number of anchor points and kernel bandwidths similarly to other hyperparameters. |