An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
Authors: Richard S. Sutton, A. Rupam Mahmood, Martha White
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical examples elucidating the main theoretical results are presented in the last section prior to the conclusion. The thin blue lines in Figure 3 (left) show the trajectories of the single parameter θ over time in 50 runs with this problem with λ=0 and α=0.001, starting at θ=1.0. Finally, Figure 4 shows trajectories for the 5-state example shown earlier (and again in the upper part of the figure). |
| Researcher Affiliation | Academia | Richard S. Sutton EMAIL A. Rupam Mahmood EMAIL Martha White EMAIL Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University of Alberta Edmonton, Alberta, Canada T6G 2E8 |
| Pseudocode | No | The paper describes the Emphatic TD(λ) algorithm using mathematical equations (17-20) but does not present it in a clearly labeled or formatted pseudocode block. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository. |
| Open Datasets | No | The paper uses illustrative synthetic examples like the 'θ 2θ problem' and a '5-state chain MDP', which are described within the text. It does not refer to any established public datasets or provide access information for any external data sources. |
| Dataset Splits | No | The empirical examples describe running simulations for a number of times (e.g., '50 runs' or 'Twenty learning curves') on synthetic problems. This does not involve traditional training/test/validation dataset splits, and no such split information is provided. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not list any specific software dependencies or their version numbers (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | The thin blue lines in Figure 3 (left) show the trajectories of the single parameter θ over time in 50 runs with this problem with λ=0 and α=0.001, starting at θ=1.0. Off-policy TD(0), on the other hand, diverged to infinity in all individual runs. For comparison, Figure 3 (right) shows trajectories for a θ 2θ problem in which Ft and all the other variables and their variances are bounded. In this problem... we used a smaller step size, α = 0.0001; other settings were unchanged. Here λ = 0, θ0 = 0, α = 0.001, and i(s) = 1, s. |