reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Off-policy Learning With Eligibility Traces: A Survey

Authors: Matthieu Geist, Bruno Scherrer

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments suggest that the most standard algorithms on and oﬀ-policy LSTD(λ)/LSPE(λ) and TD(λ) if the feature space dimension is too large for a least-squares approach perform the best. ... This section aims at empirically comparing the surveyed algorithms.
Researcher Affiliation	Academia	Matthieu Geist EMAIL IMS-Ma LIS Research Group & UMI 2958 (Georgia Tech-CNRS) Supélec 2 rue Edouard Belin 57070 Metz, France. Bruno Scherrer EMAIL MAIA project-team INRIA Lorraine 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France
Pseudocode	Yes	Algorithm 1: Oﬀ-policy LSTD(λ)... Algorithm 2: Oﬀ-policy LSPE(λ)... Algorithm 3: Oﬀ-policy FPKF(λ)... Algorithm 4: Oﬀ-policy BRM(λ)... Algorithm 5: Oﬀ-policy TD(λ)... Algorithm 6: Oﬀ-policy TDC(λ), also known as GQ(λ)... Algorithm 7: Oﬀ-policy GTD2(λ)... Algorithm 8: Oﬀ-policy g BRM(λ)
Open Source Code	No	The paper does not provide any concrete access to source code for the methodology described.
Open Datasets	Yes	More precisely, we consider Garnet problems (Archibald et al., 1995), which are a class of randomly constructed ﬁnite MDPs.
Dataset Splits	Yes	For each problem, we generate one trajectory of length 10^4 using the behavioral policy... Finally, for each case, for all problems and each algorithm, we choose the combination of meta-parameters which minimizes the average error on the last one-tenth of the averaged (over all problems) learning curves (we do this to reduce the sensitivity to the initialization and the transient behavior).
Hardware Specification	No	The paper does not provide any specific hardware details for running its experiments.
Software Dependencies	No	The paper does not provide any specific ancillary software details with version numbers.
Experiment Setup	Yes	For all algorithms, we choose θ0 = 0. For least-squares algorithms (LSTD, LSPE, FPKF and BRM), we set the initial matrices (M0, N0, C0) to 10^3I... We use the following schedule for the learning rates: αi = α0 αc αc + i and βi = β0 βc βc + i 2 3 . ... For each meta-parameter, we consider the following ranges of values: λ {0, 0.4, 0.7, 0.9, 1}, α0 {10^ 2, 10^ 1, 100}, αc {10^1, 10^2, 10^3}, β0 {10^ 2, 10^ 1, 100} and βc {10^1, 10^2, 10^3}.