The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes

Authors: Pedro Pinto Santos, Alberto Sardinha, Francisco S. Melo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation. (from Abstract) and 4.3. Empirical results We now empirically assess the impact of different parameters in the mismatch between f K,H(π) and f (π) for arbitrary fixed π. ... In Fig. 2, a set of plots displays the average finite trials objective function, f( ˆd TK,H), in comparison to the infinite trials objective f(dπ), under GUMDP Mf,1.
Researcher Affiliation Academia 1INESC-ID, Lisbon, Portugal 2Instituto Superior T ecnico, Lisbon, Portugal 3PUC-Rio, Rio de Janeiro, Brazil. All these are academic or research institutions.
Pseudocode Yes Algorithm 1 Estimating f K,H(π) via samples. 1: Inputs: N N (num. of iterations), K N (num. of trajectories), H N (trajectories horizon), and γ (discount factor). 2: ˆf0 = 0 3: for n in {1, . . . , N} do...
Open Source Code Yes The code used can be found in the following repository.
Open Datasets No Throughout this paper, we make use of the GUMDPs depicted in Fig. 1, which are representative of three common tasks in the convex RL literature. The paper describes the structure of these GUMDPs (Mf,1, Mf,2, Mf,3) but does not provide access information (link, citation, repository) for these or any other external datasets.
Dataset Splits No The paper discusses generating 'sampled trajectories' and varying the 'number of trials K' and 'trajectories length H' for simulations, rather than using static datasets with explicit training/test/validation splits. Therefore, it does not provide dataset split information in the conventional sense.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or cloud instances) used for running the experiments were provided in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiments were provided in the paper.
Experiment Setup Yes Under Mf,1, π(left|s0) = π(right|s0) = 0.5, and π(right|s1) = π(left|s2) = 1; for Mf,2 and Mf,3, π is uniformly random. We consider 100 random seeds and report 95% bootstrapped confidence intervals (shaded areas in plots). (from Section 4.3) and Algorithm 1 Estimating f K,H(π) via samples. 1: Inputs: N N (num. of iterations), K N (num. of trajectories), H N (trajectories horizon), and γ (discount factor).