reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Authors: Simone Drago, Marco Mussi, Alberto Maria Metelli

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linking preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories... Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance... Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback...
Researcher Affiliation	Academia	1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <EMAIL>.
Pseudocode	No	The paper describes methods and proposes a heuristic (e.g., "We propose a method to construct a multi-dimensional utility function u that is compatible with ĺT based on dividing the problem into three phases...") and an algorithm (e.g., "Caceres et al. (2022) proposes an algorithm that runs in Opw2\|T\| \|E\|q.") but does not include any pseudocode or algorithm blocks within the main text.
Open Source Code	No	The paper does not contain any explicit statements about the release of source code or links to code repositories.
Open Datasets	No	The paper is theoretical and does not describe experiments using specific datasets. It references "data collected by eliciting pairwise human preferences" in the context of large language models but does not use such data for its own contributions.
Dataset Splits	No	The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset split information is provided.
Hardware Specification	No	The paper is theoretical and does not describe any experiments that would require hardware specifications.
Software Dependencies	No	The paper is theoretical and does not describe any experimental setup that would require specific software dependencies and versions. It mentions "convex optimization tools (Boyd and Vandenberghe, 2004)" but this refers to a textbook, not a specific software package with a version.
Experiment Setup	No	The paper is theoretical and does not present an experimental setup with specific hyperparameters or training configurations.