Truncated Emphatic Temporal Difference Methods for Prediction and Control
Authors: Shangtong Zhang, Shimon Whiteson
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically investigate the proposed truncated emphatic TD methods, focusing on the effect of n. The implementation is made publicly available to facilitate future research. We first use Baird s counterexample as the benchmark, which is illustrated in Figure 1. We consider three different settings: prediction, control with a fixed behavior policy, and control with a changing behavior policy. For the prediction setting, we consider a behavior policy µ(solid|s) = 17 and µ(dashed|s) = 67, which is the same as the behavior policy used in Sutton and Barto (2018). We consider different target policies from π(dashed|s) = 0 to π(dashed|s) = 0.1. We consider linear function approximation, where the features and the initialization of the weight vector are the same as Section D.2 of Zhang et al. (2021b). We benchmark Algorithm 1 with different selection of n. When n = , Algorithm 1 reduces to the original ETD(0). When n = 0, Algorithm 1 reduces to the naive off-policy TD. We use a fixed learning rate α, which is tuned from Λα .= 0.1 20, 0.1 2 1, . . . , 0.1 2 19 for each n, with 30 independent runs. We report learning curves with the learning rate minimizing the value prediction error at the end of training. |
| Researcher Affiliation | Academia | Shangtong Zhang EMAIL Department of Computer Science University of Oxford Wolfson Building, Parks Rd, Oxford, OX1 3QD, UK Shimon Whiteson EMAIL Department of Computer Science University of Oxford Wolfson Building, Parks Rd, Oxford, OX1 3QD, UK |
| Pseudocode | Yes | Algorithm 1: Truncated Emphatic TD Algorithm 2: Projected Truncated Emphatic TD Algorithm 3: Truncated Emphatic Expected SARSA Algorithm 4: Projected Truncated Emphatic Expected SARSA |
| Open Source Code | Yes | The implementation is made publicly available to facilitate future research.3 3https://github.com/Shangtong Zhang/Deep RL |
| Open Datasets | Yes | We first use Baird s counterexample as the benchmark, which is illustrated in Figure 1. We further evaluate Truncated Emphatic TD methods in the Cart Pole domain (Figure 5), which is a classical nonsynthetic control problem. |
| Dataset Splits | No | The paper uses benchmark environments like "Baird's counterexample" and "Cart Pole domain," which are typically simulated, rather than pre-collected datasets that require explicit train/test/validation splits. The experimental setup describes how policies are evaluated over steps and episodes, but not how a static dataset is partitioned. |
| Hardware Specification | No | The experiments were made possible by a generous equipment grant from NVIDIA. |
| Software Dependencies | No | We use tile coding (Sutton, 1995) to map the four-dimensional observation (velocity, acceleration, angular velocity, angular acceleration) to a binary vector in R1024 and then apply linear function approximation. In particular, we use the tile coding software recommended in Chapter 10.1 of Sutton and Barto (2018). |
| Experiment Setup | Yes | We use a fixed learning rate α, which is tuned from Λα .= 0.1 20, 0.1 2 1, . . . , 0.1 2 19 for each n, with 30 independent runs. We tune β in {0.1, 0.2, 0.4, 0.8}. For each β, we tune the learning rate α in Λα as before. The interest is 1 for all states (i.e., i(s) 1 s). We use γ = 0.99 and i(s) = 1. The target policy is a softmax policy with temperature τ = 0.01. The behavior policy is a ϵ-softmax policy with ϵ = 0.95 and τ = 1. In other words, at each time step, with probability 0.95, the agent selects an action according to a uniformly random policy; with probability 0.05, the agent selects an action according to a softmax policy with temperature τ = 1. We evaluate the agent every 5 103 steps during the training process for 10 episodes and report the averaged undiscounted episodic return. |