Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

Authors: Armin Behnamnia, Gholamali Aminian, Alireza Aghaei, Chengchun Shi, Vincent Tan, Hamid R. Rabiee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach.
Researcher Affiliation Academia 1Department of Computer Engineering, Sharif University of Technology 2The Alan Turing Institute, London, UK 3Department of Computer Science, Stony Brook University 4Department of Statistic, London School of Economics 5Department of Electrical and Computer Engineering, National University of Singapore.
Pseudocode No The paper describes the Log-Sum-Exponential Estimator and its theoretical foundations and experiments, but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes The code for our estimator is available at the following link: https: //github.com/armin-behnamnia/ lse-offpolicy-learning .
Open Datasets Yes Datasets: In off-policy learning scenario, we apply the standard supervised to bandit transformation (Beygelzimer & Langford, 2009) on a classification dataset: Extended MNIST (EMNIST) (Xiao et al., 2017) to generate the LBF dataset. We also run on FMNIST in App.G.2 .", "We applied our method to the Kuairec, a public real-world recommendation system dataset ((Gao et al., 2022)).", "We evaluate our method s performance in OPE by conducting experiments on 5 UCI classification datasets, as explained in Table 32,"
Dataset Splits Yes For LSE and ES, we use 0.2 of the dataset as a validation set to find the hyperparameter with the lowest MSE by grid search and evaluate the method on the remaining 0.8 of the dataset.
Hardware Specification Yes Computational resources: We have taken all our experiments using 3 servers, one with a nvidia 1080 Ti and one with two nvidia GTX 4090, and one with three nvidia 2070-Super GPUs.
Software Dependencies No The code for this study is written in Python. We use Pytorch for the training of our model. The supplementary material includes a zip file named rl_without_reward.zip with the following files: ... requirements.txt contains the Python libraries required to reproduce our results.
Experiment Setup Yes We use mini-batch SGD as an optimizer for all experiments. The learning rate used for EMNIST and FMNIST datasets is 0.001. Furthermore, we use early stopping in our training phase and the maximum number of epochs is 300." and "Table 8: Hyperparameter of Different Estimators"