Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics
Authors: Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, Amir-Massoud Farahmand
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners. |
| Researcher Affiliation | Collaboration | 1University of Toronto 2Vector Institute 3Google Deep Mind 4Meta Platforms, Inc. Work done while at Google Deepmind 5Polytechnique Montr eal 6Mila. Correspondence to: Tyler Kastner <EMAIL>, Mark Rowland <EMAIL>. |
| Pseudocode | No | The paper describes algorithms through mathematical equations like Equation (5) and (6) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | Garnet domain: We sample a sparse Garnet MDP transition structure (Archibald et al., 1995). |
| Dataset Splits | No | The empirical evaluation section describes different MDP environments and simulation parameters (e.g., '1,000 asynchronous updates', '10,000 independent seeds') but does not discuss training/test/validation splits of a dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running experiments. |
| Software Dependencies | Yes | We simulate the cumulative effect of expected KL-CTD updates with small learning rate by numerically solving the flow tϕt = (T π I)pϕt , using the default scipy.integrate.solve ivp method (Virtanen et al., 2020). |
| Experiment Setup | Yes | Cram er-CTD and KL-CTD are both run using 40 atoms uniformly spaced on [ 30, 30], a learning rate of 4 10 3 was used for TD and Cram er-CTD, and a learning rate of 1 10 1 was used for KL-CTD. |