Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Beyond Average Return in Markov Decision Processes
Authors: Alexandre Marthe, Aurélien Garivier, Claire Vernade
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment: empirical validation of the bounds on a simple MDP We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. |
| Researcher Affiliation | Academia | Alexandre Marthe UMPA ENS de Lyon Lyon, France EMAIL Aurelien Garivier UMPA UMR 5669 and LIP UMR 5668 Univ. Lyon, ENS de Lyon 46 allée d Italie F-69364 Lyon cedex 07, France EMAIL Claire Vernade University of Tuebingen Tuebingen, Germany EMAIL |
| Pseudocode | Yes | Algorithm 1 Policy Evaluation (Dynamic Programming) for Distributional RL; Algorithm 2 Pseudo-Algorithm: Exact Planning with Distributional RL; Algorithm 3 Q-Learning for Linear and Exponential Utilities |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | No | We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. We consider a Bernouilli reward distribution B(0.5) for each state so that the number of atoms for the return only grows linearly2 with the number of steps, which allows to compute the exact distribution easily. |
| Dataset Splits | No | The paper describes a simple synthetic MDP environment but does not specify any train/validation/test dataset splits or cross-validation setup for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers (e.g., library names, framework versions, or solver versions) needed to replicate the experiment. |
| Experiment Setup | Yes | We consider a simple Chain MDP environment of length H = 70 equal to the horizon (see Figure 1 (right)) [Rowland et al., 2019], with a single action leading to the same discrete reward distribution for every step. We consider a Bernouilli reward distribution B(0.5) for each state so that the number of atoms for the return only grows linearly2 with the number of steps, which allows to compute the exact distribution easily... with a quantile projection with resolution N = 1000... We also empirically validate Theorem 1 by computing the CVa R(α) for α {0.1, 0.25}, corresponding respectively to distorted means with Lipschitz constants L = {10, 4}. |