Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Authors: Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, Amir-Massoud Farahmand

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.
Researcher Affiliation Collaboration 1University of Toronto 2Vector Institute 3Google Deep Mind 4Meta Platforms, Inc. Work done while at Google Deepmind 5Polytechnique Montr eal 6Mila. Correspondence to: Tyler Kastner <EMAIL>, Mark Rowland <EMAIL>.
Pseudocode No The paper describes algorithms through mathematical equations like Equation (5) and (6) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials.
Open Datasets Yes Garnet domain: We sample a sparse Garnet MDP transition structure (Archibald et al., 1995).
Dataset Splits No The empirical evaluation section describes different MDP environments and simulation parameters (e.g., '1,000 asynchronous updates', '10,000 independent seeds') but does not discuss training/test/validation splits of a dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running experiments.
Software Dependencies Yes We simulate the cumulative effect of expected KL-CTD updates with small learning rate by numerically solving the flow tϕt = (T π I)pϕt , using the default scipy.integrate.solve ivp method (Virtanen et al., 2020).
Experiment Setup Yes Cram er-CTD and KL-CTD are both run using 40 atoms uniformly spaced on [ 30, 30], a learning rate of 4 10 3 was used for TD and Cram er-CTD, and a learning rate of 1 10 1 was used for KL-CTD.