Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Authors: Ronald Ortner
JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We give a simple optimistic algorithm for which it is easy to derive regret bounds of O( tmix SAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter tmix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter. The proof of the regret bound is much simpler than for bounds achieved before and relies on concentration results for Markov chains. |
| Researcher Affiliation | Academia | Ronald Ortner EMAIL Department Mathematik und Informationstechnolgie Montanuniversit at Leoben Franz-Josef-Strasse 18 8700 Leoben, Austria |
| Pseudocode | Yes | Algorithm 1 Optimistic Sample Path (Osp) and Algorithm 2 Path Construction |
| Open Source Code | No | However, at the moment this is not more than an idea for future research and the details of such an algorithm are yet to be developed. |
| Open Datasets | No | The paper is theoretical and focuses on deriving regret bounds for reinforcement learning in Markov decision processes. It does not describe any experiments that would use a specific dataset. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset splits are discussed. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental setup or empirical evaluation, thus no hardware specifications are mentioned. |
| Software Dependencies | No | The paper describes a theoretical algorithm and its regret analysis, but does not provide details on specific software implementations or their version numbers. |
| Experiment Setup | No | The paper focuses on theoretical regret bounds and algorithm design for reinforcement learning, without detailing any experimental setup or hyperparameters. |