reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Truncated Emphatic Temporal Difference Methods for Prediction and Control

Authors: Shangtong Zhang, Shimon Whiteson

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically investigate the proposed truncated emphatic TD methods, focusing on the eﬀect of n. The implementation is made publicly available to facilitate future research. We ﬁrst use Baird s counterexample as the benchmark, which is illustrated in Figure 1. We consider three diﬀerent settings: prediction, control with a ﬁxed behavior policy, and control with a changing behavior policy. For the prediction setting, we consider a behavior policy µ(solid\|s) = 17 and µ(dashed\|s) = 67, which is the same as the behavior policy used in Sutton and Barto (2018). We consider diﬀerent target policies from π(dashed\|s) = 0 to π(dashed\|s) = 0.1. We consider linear function approximation, where the features and the initialization of the weight vector are the same as Section D.2 of Zhang et al. (2021b). We benchmark Algorithm 1 with diﬀerent selection of n. When n = , Algorithm 1 reduces to the original ETD(0). When n = 0, Algorithm 1 reduces to the naive oﬀ-policy TD. We use a ﬁxed learning rate α, which is tuned from Λα .= 0.1 20, 0.1 2 1, . . . , 0.1 2 19 for each n, with 30 independent runs. We report learning curves with the learning rate minimizing the value prediction error at the end of training.
Researcher Affiliation	Academia	Shangtong Zhang EMAIL Department of Computer Science University of Oxford Wolfson Building, Parks Rd, Oxford, OX1 3QD, UK Shimon Whiteson EMAIL Department of Computer Science University of Oxford Wolfson Building, Parks Rd, Oxford, OX1 3QD, UK
Pseudocode	Yes	Algorithm 1: Truncated Emphatic TD Algorithm 2: Projected Truncated Emphatic TD Algorithm 3: Truncated Emphatic Expected SARSA Algorithm 4: Projected Truncated Emphatic Expected SARSA
Open Source Code	Yes	The implementation is made publicly available to facilitate future research.3 3https://github.com/Shangtong Zhang/Deep RL
Open Datasets	Yes	We ﬁrst use Baird s counterexample as the benchmark, which is illustrated in Figure 1. We further evaluate Truncated Emphatic TD methods in the Cart Pole domain (Figure 5), which is a classical nonsynthetic control problem.
Dataset Splits	No	The paper uses benchmark environments like "Baird's counterexample" and "Cart Pole domain," which are typically simulated, rather than pre-collected datasets that require explicit train/test/validation splits. The experimental setup describes how policies are evaluated over steps and episodes, but not how a static dataset is partitioned.
Hardware Specification	No	The experiments were made possible by a generous equipment grant from NVIDIA.
Software Dependencies	No	We use tile coding (Sutton, 1995) to map the four-dimensional observation (velocity, acceleration, angular velocity, angular acceleration) to a binary vector in R1024 and then apply linear function approximation. In particular, we use the tile coding software recommended in Chapter 10.1 of Sutton and Barto (2018).
Experiment Setup	Yes	We use a ﬁxed learning rate α, which is tuned from Λα .= 0.1 20, 0.1 2 1, . . . , 0.1 2 19 for each n, with 30 independent runs. We tune β in {0.1, 0.2, 0.4, 0.8}. For each β, we tune the learning rate α in Λα as before. The interest is 1 for all states (i.e., i(s) 1 s). We use γ = 0.99 and i(s) = 1. The target policy is a softmax policy with temperature τ = 0.01. The behavior policy is a ϵ-softmax policy with ϵ = 0.95 and τ = 1. In other words, at each time step, with probability 0.95, the agent selects an action according to a uniformly random policy; with probability 0.05, the agent selects an action according to a softmax policy with temperature τ = 1. We evaluate the agent every 5 103 steps during the training process for 10 episodes and report the averaged undiscounted episodic return.