Off-policy Learning With Eligibility Traces: A Survey
Authors: Matthieu Geist, Bruno Scherrer
JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments suggest that the most standard algorithms on and off-policy LSTD(λ)/LSPE(λ) and TD(λ) if the feature space dimension is too large for a least-squares approach perform the best. ... This section aims at empirically comparing the surveyed algorithms. |
| Researcher Affiliation | Academia | Matthieu Geist EMAIL IMS-Ma LIS Research Group & UMI 2958 (Georgia Tech-CNRS) Supélec 2 rue Edouard Belin 57070 Metz, France. Bruno Scherrer EMAIL MAIA project-team INRIA Lorraine 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France |
| Pseudocode | Yes | Algorithm 1: Off-policy LSTD(λ)... Algorithm 2: Off-policy LSPE(λ)... Algorithm 3: Off-policy FPKF(λ)... Algorithm 4: Off-policy BRM(λ)... Algorithm 5: Off-policy TD(λ)... Algorithm 6: Off-policy TDC(λ), also known as GQ(λ)... Algorithm 7: Off-policy GTD2(λ)... Algorithm 8: Off-policy g BRM(λ) |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | Yes | More precisely, we consider Garnet problems (Archibald et al., 1995), which are a class of randomly constructed finite MDPs. |
| Dataset Splits | Yes | For each problem, we generate one trajectory of length 10^4 using the behavioral policy... Finally, for each case, for all problems and each algorithm, we choose the combination of meta-parameters which minimizes the average error on the last one-tenth of the averaged (over all problems) learning curves (we do this to reduce the sensitivity to the initialization and the transient behavior). |
| Hardware Specification | No | The paper does not provide any specific hardware details for running its experiments. |
| Software Dependencies | No | The paper does not provide any specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For all algorithms, we choose θ0 = 0. For least-squares algorithms (LSTD, LSPE, FPKF and BRM), we set the initial matrices (M0, N0, C0) to 10^3I... We use the following schedule for the learning rates: αi = α0 αc αc + i and βi = β0 βc βc + i 2 3 . ... For each meta-parameter, we consider the following ranges of values: λ {0, 0.4, 0.7, 0.9, 1}, α0 {10^ 2, 10^ 1, 100}, αc {10^1, 10^2, 10^3}, β0 {10^ 2, 10^ 1, 100} and βc {10^1, 10^2, 10^3}. |