An Analysis of Quantile Temporal-Difference Learning

Authors: Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
Researcher Affiliation Collaboration Mark Rowland EMAIL Google DeepMind, London, UK R emi Munos EMAIL Google DeepMind, Paris, France Mohammad Gheshlaghi Azar EMAIL Google DeepMind, Seattle, USA Yunhao Tang EMAIL Google DeepMind, London, UK Georg Ostrovski EMAIL Google DeepMind, London, UK Anna Harutyunyan EMAIL Google DeepMind, London, UK Karl Tuyls EMAIL Google DeepMind, Paris, France Marc G. Bellemare EMAIL Reliant AI & Mc Gill University, Montr eal, Canada Will Dabney EMAIL Google DeepMind, Seattle, USA
Pseudocode Yes Algorithm 1 QTD update Algorithm 2 Quantile dynamic programming Algorithm 3 Quantile dynamic programming (finitely-supported rewards) Algorithm 4 Quantile dynamic programming (reward CDFs)
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions that simulations were generated using Python 3 and several libraries, but does not state that the authors' implementation code for QTD is openly available or provide a link.
Open Datasets No The paper describes numerical examples on custom-defined small MDPs (e.g., a chain MDP, a two-state MDP with Gaussian or Dirac delta rewards) to illustrate theoretical concepts. It does not provide access information for any publicly available or open datasets used in its own analysis. References to benchmark domains like the Arcade Learning Environment in the introduction pertain to past applications of QTD, not the current paper's experimental validation.
Dataset Splits No The paper focuses on theoretical analysis and uses small, custom-defined Markov Decision Processes (MDPs) for numerical examples and illustrations. These examples do not involve datasets with explicit training/test/validation splits.
Hardware Specification No The paper states: "The simulations in this paper were generated using the Python 3 language..." but provides no specific details about the hardware (CPU, GPU models, memory, etc.) used for these simulations or any other experiments.
Software Dependencies No The paper mentions: "The simulations in this paper were generated using the Python 3 language, and made use of the Num Py (Harris et al., 2020), Sci Py (Virtanen et al., 2020), and Matplotlib (Hunter, 2007) libraries." While the software names are listed, specific version numbers for these libraries are not provided. Python 3 is a language, not a library with a specific version number listed.
Experiment Setup Yes Example 2, discussing a chain MDP, mentions: "using a constant learning rate of 0.01". Example 3, discussing a two-state MDP, specifies a "discount factor γ = 0.5" for the environment.