Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories... Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance... Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback...
Researcher Affiliation Academia 1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <EMAIL>.
Pseudocode No The paper describes methods and proposes a heuristic (e.g., "We propose a method to construct a multi-dimensional utility function u that is compatible with ĺT based on dividing the problem into three phases...") and an algorithm (e.g., "Caceres et al. (2022) proposes an algorithm that runs in Opw2|T| |E|q.") but does not include any pseudocode or algorithm blocks within the main text.
Open Source Code No The paper does not contain any explicit statements about the release of source code or links to code repositories.
Open Datasets No The paper is theoretical and does not describe experiments using specific datasets. It references "data collected by eliciting pairwise human preferences" in the context of large language models but does not use such data for its own contributions.
Dataset Splits No The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset split information is provided.
Hardware Specification No The paper is theoretical and does not describe any experiments that would require hardware specifications.
Software Dependencies No The paper is theoretical and does not describe any experimental setup that would require specific software dependencies and versions. It mentions "convex optimization tools (Boyd and Vandenberghe, 2004)" but this refers to a textbook, not a specific software package with a version.
Experiment Setup No The paper is theoretical and does not present an experimental setup with specific hyperparameters or training configurations.