Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback
Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories... Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance... Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback... |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <EMAIL>. |
| Pseudocode | No | The paper describes methods and proposes a heuristic (e.g., "We propose a method to construct a multi-dimensional utility function u that is compatible with ĺT based on dividing the problem into three phases...") and an algorithm (e.g., "Caceres et al. (2022) proposes an algorithm that runs in Opw2|T| |E|q.") but does not include any pseudocode or algorithm blocks within the main text. |
| Open Source Code | No | The paper does not contain any explicit statements about the release of source code or links to code repositories. |
| Open Datasets | No | The paper is theoretical and does not describe experiments using specific datasets. It references "data collected by eliciting pairwise human preferences" in the context of large language models but does not use such data for its own contributions. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset split information is provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any experiments that would require hardware specifications. |
| Software Dependencies | No | The paper is theoretical and does not describe any experimental setup that would require specific software dependencies and versions. It mentions "convex optimization tools (Boyd and Vandenberghe, 2004)" but this refers to a textbook, not a specific software package with a version. |
| Experiment Setup | No | The paper is theoretical and does not present an experimental setup with specific hyperparameters or training configurations. |