Foundations of Multivariate Distributional Reinforcement Learning
Authors: Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Mark Rowland
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice. and 6.1 Simulations: Distributional Successor Features |
| Researcher Affiliation | Collaboration | Harley Wiltzer Mila Québec AI Institute Mc Gill University EMAIL Jesse Farebrother Mila Québec AI Institute Mc Gill Unversity EMAIL Arthur Gretton Google Deep Mind Gatsby Unit, University College London EMAIL Mark Rowland Google Deep Mind EMAIL |
| Pseudocode | Yes | Algorithm 1 Projected Categorical Dynamic Programming |
| Open Source Code | No | The NeurIPS Paper Checklist states 'Code will be provided.', which is a future promise, not a current release of the code for the work described in the paper. |
| Open Datasets | No | The paper describes using '100 random MDPs, with transitions drawn from Dirichlet priors and 2-dimensional cumulants drawn from uniform priors.' This indicates custom-generated data rather than a specific, named, publicly available dataset with a concrete access link or formal citation. |
| Dataset Splits | No | The paper does not explicitly provide details about training/test/validation dataset splits, nor does it reference predefined splits or cross-validation setups for the MDP data used in experiments. |
| Hardware Specification | Yes | TD-learning experiments were conducted on a NVidia A100 80G GPU to parallelize experiments. |
| Software Dependencies | No | The paper mentions software like 'Jax [BFH+18]' and 'Jax Opt [BBC+21]' and the 'Julia programming language [BEKS17]', but it does not provide specific version numbers for these software components (e.g., 'Jax 0.x' or 'Julia 1.x'). |
| Experiment Setup | Yes | SGD was used for optimization, using an annealed learning rate schedule (λk)k 0 with λk = k 3/5, satisfying the conditions of Lemma 10. |