Regularized Policy Iteration with Nonparametric Function Spaces

Authors: Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, Shie Mannor

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.1
Researcher Affiliation Collaboration Amir-massoud Farahmand EMAIL Mitsubishi Electric Research Laboratories (MERL) 201 Broadway, 8th Floor Cambridge, MA 02139, USA Mohammad Ghavamzadeh EMAIL Adobe Research 321 Park Avenue San Jose, CA 95110, USA Csaba Szepesv ari EMAIL Department of Computing Science University of Alberta Edmonton, AB, T6G 2E8, Canada Shie Mannor EMAIL Department of Electrical Engineering The Technion Haifa 32000, Israel
Pseudocode Yes Algorithm 1 Regularized Policy Iteration(K, ˆQ( 1),F|A|,J,{(λ(k) Q,n, λ(k) h,n)}K 1 k=0 ) // K: Number of iterations // ˆQ( 1): Initial action-value function // F|A|: The action-value function space // J: The regularizer // {(λ(k) Q,n, λ(k) h,n)}K k=0: The regularization coefficients for k = 0 to K 1 do πk( ) ˆπ( ; ˆQ(k 1)) Generate training samples D(k) n ˆQ(k) REG-LSTD/BRM(πk, D(k) n ; F|A|, J, λ(k) Q,n, λ(k) h,n) end for return ˆQ(K 1) and πK( ) = ˆπ( ; ˆQ(K 1))
Open Source Code No The paper does not provide concrete access to source code. It discusses the importance of designing efficient implementations for future work, stating: "Designing scalable optimization algorithms for REG-LSPI/BRM is a topic for future work."
Open Datasets No The paper is theoretical and focuses on algorithm design and statistical properties. It introduces concepts like "a batch of data Dn" for theoretical analysis of an "offline learning scenario" (Section 2.3) but does not refer to any specific, publicly available dataset used for experiments or provide any links or citations for data access.
Dataset Splits No The paper is theoretical and does not conduct empirical experiments with specific datasets. Therefore, it does not provide any training/test/validation dataset splits. It discusses theoretical sampling assumptions like "samples Xi and Xi+1 may be sampled independently (we call this the Planning scenario )" but this is not about practical data splitting for reproduction.
Hardware Specification No The paper is theoretical and does not describe any experimental hardware used for running simulations or computations. There are no mentions of specific GPU/CPU models, processors, or computing environments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies or version numbers needed to replicate potential experimental results. It mentions
Experiment Setup No The paper is theoretical and describes algorithms and their statistical properties. It does not provide specific details about experimental setup, hyperparameters, optimizer settings, or training configurations for empirical evaluation.