reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Analysis of Classification-based Policy Iteration Algorithms

Authors: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We introduce a variant of the classiﬁcation-based approach to policy iteration which uses a cost-sensitive loss function weighting each classiﬁcation mistake by its actual regret, that is, the diﬀerence between the action-value of the greedy action and of the action chosen by the classiﬁer. For this algorithm, we provide a full ﬁnite-sample analysis. In this paper, we presented a variant of the classiﬁcation-based approach to approximate policy iteration (API) called direct policy iteration (DPI) and provided its ﬁnite-sample performance bounds. To the best of our knowledge, this is the ﬁrst complete ﬁnite-sample analysis for this class of API algorithms.
Researcher Affiliation	Collaboration	Alessandro Lazaric1 EMAIL Mohammad Ghavamzadeh2,1 EMAIL R emi Munos3,1 EMAIL 1 INRIA Lille Team Seque L, France 2 Adobe Research, USA 3 Google Deep Mind, UK
Pseudocode	Yes	Figure 1: The Direct Policy Iteration (DPI) algorithm. Input: policy space Π Bπ(X), state distribution ρ, number of rollout states N, number of rollouts per state-action pair M, rollout horizon H Initialize: Let π0 Π be an arbitrary policy for k = 0, 1, 2, . . . do Construct the rollout set Dk = {xi}N i=1, xi iid ρ for all states xi Dk and actions a A do for j = 1 to M do Perform a rollout according to policy πk and return Rπk j (xi, a) = r(xi, a) + t=1 γtr xt, πk(xt) , with xt p \|xt 1, πk(xt 1) and x1 p( \|xi, a) end for b Qπk(xi, a) = 1 M PM j=1 Rπk j (xi, a) end for πk+1 = arg minπ Π b Lπk(bρ; π) (classiﬁer) end for
Open Source Code	No	No explicit statement or link regarding open-source code for the methodology described in this paper was found.
Open Datasets	No	The paper does not mention using any specific datasets for experiments or provide access information for any open datasets. The paper is purely theoretical, focusing on analysis and bounds of an algorithm. For example, Section 5.1 presents a 'Counterexample' MDP, which is a theoretical construct rather than an actual dataset.
Dataset Splits	No	The paper is theoretical and does not conduct experiments on datasets, therefore no dataset splits are provided.
Hardware Specification	No	The paper is theoretical and does not report experimental results, thus no hardware specifications are mentioned.
Software Dependencies	No	The paper is theoretical and does not provide an implementation or experimental results that would necessitate a list of software dependencies with specific version numbers.
Experiment Setup	No	The paper is theoretical and focuses on algorithm analysis and bounds, without describing any experimental setup, hyperparameters, or training configurations.