Learning Utilities from Demonstrations in Markov Decision Processes

Authors: Filippo Lazzati, Alberto Maria Metelli

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Numerical Simulations In this section, we present proof-of-concept experiments using data collected from lab members to provide empirical evidence to support both our model and algorithms.
Researcher Affiliation Academia 1Politecnico di Milano, Milan, Italy. Correspondence to: Filippo Lazzati <EMAIL>.
Pseudocode Yes Algorithm 1 CATY-UL Input: data t DE i ui, threshold , utility U, discretization ϵ0, dynamics tppiui
Open Source Code No The paper does not contain any explicit statement about providing open-source code nor a link to a code repository.
Open Datasets No We asked to 15 participants to describe the actions they would play in an MDP with horizon H 5 (see Appendix F), at varying of the state, the stage, and the cumulative reward collected. The reward has a monetary interpretation. To answer the questions, the participants have been provided with complete information about the MDP.8 The data collected is not personal.
Dataset Splits No We asked to 15 participants to describe the actions they would play in an MDP... We consider the policy of the 10th participant (chosen arbitrarily) to the survey, and we execute TRACTOR-UL multiple times with varying values of the input parameters...
Hardware Specification Yes The experiment has been conducted in some hours on a personal computer with processor AMD Ryzen 5 5500U with Radeon Graphics (2.10 GHz), with 8,00 GB of RAM.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for its methodology.
Experiment Setup Yes we always use K 10000 trajectories for estimating the return distribution of the 10th participant s policy, and the return distribution of the optimal policies computed along the way; we make 5 runs with each combination of parameters with different seeds. We execute for T 70 iterations using Lipschitz constant L 10... As initial utility U 0, we try Usqrt, Usquare, and Ulinear (see Appendix F.3), and as learning rates we try 0.01, 0.5, 5, 100, 1000, 10000.