Practical Kernel-Based Reinforcement Learning

Authors: André M.S. Barreto, Doina Precup, Joelle Pineau

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The potential of our algorithm is demonstrated in an extensive empirical study in which KBSF is applied to difficult tasks based on real-world data. Not only does KBSF solve problems that had never been solved before, but it also significantly outperforms other state-of-the-art reinforcement learning algorithms on the tasks studied.
Researcher Affiliation Academia Andr e M. S. Barreto EMAIL Laborat orio Nacional de Computa c ao Cient ıfica Petr opolis, Brazil Doina Precup EMAIL Joelle Pineau EMAIL School of Computer Science Mc Gill University Montreal, Canada
Pseudocode Yes Algorithm 1 Batch KBSF Algorithm 2 Update KBSF s MDP Algorithm 3 Incremental KBSF (i KBSF)
Open Source Code No The paper does not contain an explicit statement by the authors about making their own implementation code for KBSF available. It refers to third-party simulators and environments used for experiments, but not their own source code.
Open Datasets No The paper mentions using established models, simulators, or problems (e.g., "puddle world task (Sutton, 1996)", "simulator described by Gomez (2003)", "model that describes the interaction of the immune system with HIV...developed by Adams et al. (2004)", "generative model developed by Bush et al. (2009)", "simulator developed by Abbeel et al. (2005)"). However, the authors do not provide direct access information (links, DOIs, repositories) to raw datasets used, nor do they explicitly state that the underlying raw data is publicly available through their work or direct citations within the context of data access.
Dataset Splits No The paper describes how training data (sample transitions) were collected through interaction with simulators or generative models (e.g., "collected a set of n sample transitions (sa k, ra k, ˆsa k) using a random exploration policy", "used a 0.15-greedy policy to collect a second batch of 6000 transitions"). It also describes the conditions under which policies were evaluated, often using sets of initial states for testing (e.g., "The algorithms were evaluated on two sets of states distributed over disjoint regions of the state space surrounding the puddles", "The test set was comprised of 81 states equally spaced"). However, it does not describe traditional pre-defined train/test/validation splits of static datasets with explicit percentages, counts, or references to standard split files.
Hardware Specification No The paper implies hardware was used for experiments by stating, "A conservative estimate reveals that, were KBRL(106) run on the same computer used for these experiments, we would have to wait for more than 6 months to see the results." However, no specific details about the CPU, GPU, memory, or other hardware components of this computer are provided.
Software Dependencies No The paper mentions using "the RL-Glue package (Tanner and White, 2009)" and various algorithms like "LSPI", "fitted Q-iteration", "extra trees", and "SARSA". However, it does not provide specific version numbers for any of these software components, which is necessary for reproducible software dependencies.
Experiment Setup Yes Puddle World: "discount factor of γ = 0.99", "varied both τ and τ in the set {0.01, 0.1, 1}". Pole Balancing: "discounted task with γ = 0.99", "policy iteration was used to find a decision policy for the MDPs constructed by KBSF, and this algorithm was run for a maximum of 30 iterations", "fixed the width of KBSF s kernel κa τ at τ = 1 and varied τ in {0.01, 0.1, 1} for both algorithms." HIV drug schedule: "discounted task with γ = 0.98", "ensemble of 30 trees", "varied FQIT s parameter ηmin in the set {50, 100, 200}", "fixed τ = τ = 1 and varied m in {2000, 4000, ..., 10000}", "only computed the µ = 2 largest values of kτ( si, ) and the µ = 3 largest values of k τ(ˆsa i , )". Epilepsy suppression: "discounted task with γ = 0.99", "fixed the latter at 1 and varied the former with values in { 10, 20, 40}", "µ = µ = 6", "fixed τ = 1 and varied τ in {0.01, 0.1, 1}", "parameter ηmin varying in {20, 30, ..., 200}". Helicopter hovering: "discretized the set A using 4 values per dimension, resulting in 256 possible actions", "discounted task with γ = 0.99", "SARSA with λ = 0.05, a learning rate of 0.001, and 24 tilings containing 412 tiles each", "ϵ = 1, and at every 50000 transitions the value of ϵ was decreased in 30%", "m = 500 representative states", "tv = tm = 50000 transitions", "fixed τ = τ = 1 and µ = µ = 4".