reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How to Find the Exact Pareto Front for Multi-Objective MDPs?

Authors: Yining Li, Peizhong Ju, Ness Shroff

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical studies demonstrate the effectiveness of our theoretical strategy in discovering the Pareto front efficiently. In this section, we empirically evaluate the performance of our algorithm in discovering the full Pareto front, including the vertices that correspond to deterministic Pareto optimal policies in MO-MDPs. We compare the running time of our algorithm against OLS.
Researcher Affiliation	Academia	Department of Electrical and Computer Engineering, The Ohio State University Department of Computer Science, University of Kentucky Department of Computer Science and Engineering, The Ohio State University EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The pseudocode for the Pareto front searching algorithm is presented in Algorithm 1. While the primary objective of Algorithm 1 is to identify the entire Pareto front, a byproduct of the algorithm is the identification of all vertices on the Pareto front that correspond to all deterministic Pareto optimal policies. Algorithm 2 Pareto Optimal Face Selection. Algorithm 3 PPrune(V) (Roijers, 2016). Algorithm 4 Benchmark Algorithm.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository. The acknowledgments mention grant support but no code.
Open Datasets	No	We consider a basic MDP setting, where we only have 4 states and 3 actions, and the reward dimension is 3. The transition kernel is uniformly distributed between 0 and 1 and normalized to sum to 1, while the reward functions are uniformly distributed between 0 and 1. The discount factor is 0.9. We adopt the deep-sea-treasure setting from Yang et al. (2019a) and construct a scenario where the agent navigates a grid world, making trade-offs between different objectives.
Dataset Splits	No	The paper describes synthetic data generation and environment construction for the experiments, such as a 'basic MDP setting' and a 'deep-sea-treasure setting'. It does not involve a pre-existing dataset that would require explicit training/test/validation splits.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	The paper does not mention any specific software or library names with version numbers that were used for the implementation or experiments.
Experiment Setup	Yes	We consider a basic MDP setting, where we only have 4 states and 3 actions, and the reward dimension is 3. The transition kernel is uniformly distributed between 0 and 1 and normalized to sum to 1, while the reward functions are uniformly distributed between 0 and 1. The discount factor is 0.9. Specifically, we consider state spaces of size 5 and 8, action spaces ranging from 5 to 7, and a reward dimension of 3. For each iteration, OLS solves the single-objective MDP for a given preference vector and calculates new preference vectors by constructing a convex hull of all candidate preference vectors. In contrast, our algorithm requires only a single single-objective planning step during the initialization phase to obtain the initial deterministic Pareto-optimal policy. We consider scenarios where the state space size is 5 and 8, and the reward dimension is 4.