reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Analysis of Quantile Temporal-Difference Learning

Authors: Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	The core result of this paper is a proof of convergence to the ﬁxed points of a related family of dynamic programming procedures with probability 1, putting QTD on ﬁrm theoretical footing. The proof establishes connections between QTD and non-linear diﬀerential inclusions through stochastic approximation theory and non-smooth analysis.
Researcher Affiliation	Collaboration	Mark Rowland EMAIL Google DeepMind, London, UK R emi Munos EMAIL Google DeepMind, Paris, France Mohammad Gheshlaghi Azar EMAIL Google DeepMind, Seattle, USA Yunhao Tang EMAIL Google DeepMind, London, UK Georg Ostrovski EMAIL Google DeepMind, London, UK Anna Harutyunyan EMAIL Google DeepMind, London, UK Karl Tuyls EMAIL Google DeepMind, Paris, France Marc G. Bellemare EMAIL Reliant AI & Mc Gill University, Montr eal, Canada Will Dabney EMAIL Google DeepMind, Seattle, USA
Pseudocode	Yes	Algorithm 1 QTD update Algorithm 2 Quantile dynamic programming Algorithm 3 Quantile dynamic programming (ﬁnitely-supported rewards) Algorithm 4 Quantile dynamic programming (reward CDFs)
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions that simulations were generated using Python 3 and several libraries, but does not state that the authors' implementation code for QTD is openly available or provide a link.
Open Datasets	No	The paper describes numerical examples on custom-defined small MDPs (e.g., a chain MDP, a two-state MDP with Gaussian or Dirac delta rewards) to illustrate theoretical concepts. It does not provide access information for any publicly available or open datasets used in its own analysis. References to benchmark domains like the Arcade Learning Environment in the introduction pertain to past applications of QTD, not the current paper's experimental validation.
Dataset Splits	No	The paper focuses on theoretical analysis and uses small, custom-defined Markov Decision Processes (MDPs) for numerical examples and illustrations. These examples do not involve datasets with explicit training/test/validation splits.
Hardware Specification	No	The paper states: "The simulations in this paper were generated using the Python 3 language..." but provides no specific details about the hardware (CPU, GPU models, memory, etc.) used for these simulations or any other experiments.
Software Dependencies	No	The paper mentions: "The simulations in this paper were generated using the Python 3 language, and made use of the Num Py (Harris et al., 2020), Sci Py (Virtanen et al., 2020), and Matplotlib (Hunter, 2007) libraries." While the software names are listed, specific version numbers for these libraries are not provided. Python 3 is a language, not a library with a specific version number listed.
Experiment Setup	Yes	Example 2, discussing a chain MDP, mentions: "using a constant learning rate of 0.01". Example 3, discussing a two-state MDP, specifies a "discount factor γ = 0.5" for the environment.