Off-policy Learning With Eligibility Traces: A Survey

Authors: Matthieu Geist, Bruno Scherrer

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments suggest that the most standard algorithms on and off-policy LSTD(λ)/LSPE(λ) and TD(λ) if the feature space dimension is too large for a least-squares approach perform the best. ... This section aims at empirically comparing the surveyed algorithms.
Researcher Affiliation Academia Matthieu Geist EMAIL IMS-Ma LIS Research Group & UMI 2958 (Georgia Tech-CNRS) Supélec 2 rue Edouard Belin 57070 Metz, France. Bruno Scherrer EMAIL MAIA project-team INRIA Lorraine 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France
Pseudocode Yes Algorithm 1: Off-policy LSTD(λ)... Algorithm 2: Off-policy LSPE(λ)... Algorithm 3: Off-policy FPKF(λ)... Algorithm 4: Off-policy BRM(λ)... Algorithm 5: Off-policy TD(λ)... Algorithm 6: Off-policy TDC(λ), also known as GQ(λ)... Algorithm 7: Off-policy GTD2(λ)... Algorithm 8: Off-policy g BRM(λ)
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets Yes More precisely, we consider Garnet problems (Archibald et al., 1995), which are a class of randomly constructed finite MDPs.
Dataset Splits Yes For each problem, we generate one trajectory of length 10^4 using the behavioral policy... Finally, for each case, for all problems and each algorithm, we choose the combination of meta-parameters which minimizes the average error on the last one-tenth of the averaged (over all problems) learning curves (we do this to reduce the sensitivity to the initialization and the transient behavior).
Hardware Specification No The paper does not provide any specific hardware details for running its experiments.
Software Dependencies No The paper does not provide any specific ancillary software details with version numbers.
Experiment Setup Yes For all algorithms, we choose θ0 = 0. For least-squares algorithms (LSTD, LSPE, FPKF and BRM), we set the initial matrices (M0, N0, C0) to 10^3I... We use the following schedule for the learning rates: αi = α0 αc αc + i and βi = β0 βc βc + i 2 3 . ... For each meta-parameter, we consider the following ranges of values: λ {0, 0.4, 0.7, 0.9, 1}, α0 {10^ 2, 10^ 1, 100}, αc {10^1, 10^2, 10^3}, β0 {10^ 2, 10^ 1, 100} and βc {10^1, 10^2, 10^3}.