Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Authors: Ronald Ortner

JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We give a simple optimistic algorithm for which it is easy to derive regret bounds of O( tmix SAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter tmix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter. The proof of the regret bound is much simpler than for bounds achieved before and relies on concentration results for Markov chains.
Researcher Affiliation Academia Ronald Ortner EMAIL Department Mathematik und Informationstechnolgie Montanuniversit at Leoben Franz-Josef-Strasse 18 8700 Leoben, Austria
Pseudocode Yes Algorithm 1 Optimistic Sample Path (Osp) and Algorithm 2 Path Construction
Open Source Code No However, at the moment this is not more than an idea for future research and the details of such an algorithm are yet to be developed.
Open Datasets No The paper is theoretical and focuses on deriving regret bounds for reinforcement learning in Markov decision processes. It does not describe any experiments that would use a specific dataset.
Dataset Splits No The paper is theoretical and does not involve empirical evaluation on datasets, thus no dataset splits are discussed.
Hardware Specification No The paper is theoretical and does not describe any experimental setup or empirical evaluation, thus no hardware specifications are mentioned.
Software Dependencies No The paper describes a theoretical algorithm and its regret analysis, but does not provide details on specific software implementations or their version numbers.
Experiment Setup No The paper focuses on theoretical regret bounds and algorithm design for reinforcement learning, without detailing any experimental setup or hyperparameters.