Analysis of Classification-based Policy Iteration Algorithms

Authors: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, that is, the difference between the action-value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. In this paper, we presented a variant of the classification-based approach to approximate policy iteration (API) called direct policy iteration (DPI) and provided its finite-sample performance bounds. To the best of our knowledge, this is the first complete finite-sample analysis for this class of API algorithms.
Researcher Affiliation Collaboration Alessandro Lazaric1 EMAIL Mohammad Ghavamzadeh2,1 EMAIL R emi Munos3,1 EMAIL 1 INRIA Lille Team Seque L, France 2 Adobe Research, USA 3 Google Deep Mind, UK
Pseudocode Yes Figure 1: The Direct Policy Iteration (DPI) algorithm. Input: policy space Π Bπ(X), state distribution ρ, number of rollout states N, number of rollouts per state-action pair M, rollout horizon H Initialize: Let π0 Π be an arbitrary policy for k = 0, 1, 2, . . . do Construct the rollout set Dk = {xi}N i=1, xi iid ρ for all states xi Dk and actions a A do for j = 1 to M do Perform a rollout according to policy πk and return Rπk j (xi, a) = r(xi, a) + t=1 γtr xt, πk(xt) , with xt p |xt 1, πk(xt 1) and x1 p( |xi, a) end for b Qπk(xi, a) = 1 M PM j=1 Rπk j (xi, a) end for πk+1 = arg minπ Π b Lπk(bρ; π) (classifier) end for
Open Source Code No No explicit statement or link regarding open-source code for the methodology described in this paper was found.
Open Datasets No The paper does not mention using any specific datasets for experiments or provide access information for any open datasets. The paper is purely theoretical, focusing on analysis and bounds of an algorithm. For example, Section 5.1 presents a 'Counterexample' MDP, which is a theoretical construct rather than an actual dataset.
Dataset Splits No The paper is theoretical and does not conduct experiments on datasets, therefore no dataset splits are provided.
Hardware Specification No The paper is theoretical and does not report experimental results, thus no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not provide an implementation or experimental results that would necessitate a list of software dependencies with specific version numbers.
Experiment Setup No The paper is theoretical and focuses on algorithm analysis and bounds, without describing any experimental setup, hyperparameters, or training configurations.