How to Find the Exact Pareto Front for Multi-Objective MDPs?

Authors: Yining Li, Peizhong Ju, Ness Shroff

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies demonstrate the effectiveness of our theoretical strategy in discovering the Pareto front efficiently. In this section, we empirically evaluate the performance of our algorithm in discovering the full Pareto front, including the vertices that correspond to deterministic Pareto optimal policies in MO-MDPs. We compare the running time of our algorithm against OLS.
Researcher Affiliation Academia Department of Electrical and Computer Engineering, The Ohio State University Department of Computer Science, University of Kentucky Department of Computer Science and Engineering, The Ohio State University EMAIL, EMAIL, EMAIL
Pseudocode Yes The pseudocode for the Pareto front searching algorithm is presented in Algorithm 1. While the primary objective of Algorithm 1 is to identify the entire Pareto front, a byproduct of the algorithm is the identification of all vertices on the Pareto front that correspond to all deterministic Pareto optimal policies. Algorithm 2 Pareto Optimal Face Selection. Algorithm 3 PPrune(V) (Roijers, 2016). Algorithm 4 Benchmark Algorithm.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository. The acknowledgments mention grant support but no code.
Open Datasets No We consider a basic MDP setting, where we only have 4 states and 3 actions, and the reward dimension is 3. The transition kernel is uniformly distributed between 0 and 1 and normalized to sum to 1, while the reward functions are uniformly distributed between 0 and 1. The discount factor is 0.9. We adopt the deep-sea-treasure setting from Yang et al. (2019a) and construct a scenario where the agent navigates a grid world, making trade-offs between different objectives.
Dataset Splits No The paper describes synthetic data generation and environment construction for the experiments, such as a 'basic MDP setting' and a 'deep-sea-treasure setting'. It does not involve a pre-existing dataset that would require explicit training/test/validation splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper does not mention any specific software or library names with version numbers that were used for the implementation or experiments.
Experiment Setup Yes We consider a basic MDP setting, where we only have 4 states and 3 actions, and the reward dimension is 3. The transition kernel is uniformly distributed between 0 and 1 and normalized to sum to 1, while the reward functions are uniformly distributed between 0 and 1. The discount factor is 0.9. Specifically, we consider state spaces of size 5 and 8, action spaces ranging from 5 to 7, and a reward dimension of 3. For each iteration, OLS solves the single-objective MDP for a given preference vector and calculates new preference vectors by constructing a convex hull of all candidate preference vectors. In contrast, our algorithm requires only a single single-objective planning step during the initialization phase to obtain the initial deterministic Pareto-optimal policy. We consider scenarios where the state space size is 5 and 8, and the reward dimension is 4.