reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Off-Policy Actor-Critic with Emphatic Weightings

Authors: Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoﬀs made by each gradient approximation. We ﬁnd that by approximating the emphatic weightings directly, ACE performs as well as or better than OﬀPAC in all settings tested.
Researcher Affiliation	Academia	Eric Graves EMAIL Ehsan Imani EMAIL Raksha Kumaraswamy EMAIL Martha White EMAIL Reinforcement Learning and Artiﬁcial Intelligence Laboratory Department of Computing Science, University of Alberta Edmonton, Alberta, Canada T6G 2E8
Pseudocode	Yes	Algorithm 1 Online Actor Critic with Emphatic weightings (ACE)
Open Source Code	Yes	7. Code is available at: https://github.com/gravesec/actor-critic-with-emphatic-weightings.
Open Datasets	Yes	Next, we revisit the simple counterexample from Section 8.1 to examine how the trade-oﬀparameter can aﬀect the learned policy. Next, we illustrate some issues with using the emphatic trace in the ACE algorithm by testing it on a modiﬁed version of the counterexample. We then move to two classic control environments to study the two estimators discussed in Section 5.2 and their eﬀects on the learned policy. Finally, we test several variants of ACE on a challenging environment designed to illustrate the issues associated with both estimators. Puddle World (Degris et al., 2012b) and Mountain Car (Moore, 1990). The final environment was implemented using Minigrid (Chevalier-Boisvert et al., 2018), and is depicted in Figure 9.
Dataset Splits	No	Each combination of parameters for each algorithm was run on 5 diﬀerent trajectories generated by a ﬁxed behaviour policy interacting with the environment for 100,000 time steps. The learned policies were saved every 1,000 time steps and evaluated 50 times using both the episodic and excursions objective functions from Section 3. For the episodic objective function, the policies were evaluated by creating 50 diﬀerent instances of the environment and executing the target policy from the starting state until termination or 1000 time steps had elapsed. The excursions objective function was evaluated similarly, but with the environment s starting state drawn from the behaviour policy s steady state distribution, chosen by running the behaviour policy for 50,000 time steps and saving every thousandth state. This text describes the evaluation methodology and how data is generated during experiments but does not provide specific training/test/validation splits for a predefined dataset.
Hardware Specification	No	This research was enabled in part by support provided by Sci Net and the Digital Research Alliance of Canada. This is an acknowledgement of support for computational resources but does not specify particular hardware models (e.g., specific GPUs, CPUs, or memory details) used for the experiments.
Software Dependencies	No	The ﬁnal environment was implemented using Minigrid (Chevalier-Boisvert et al., 2018). While Minigrid is mentioned, no specific version number for it or any other key software libraries (e.g., Python, PyTorch/TensorFlow, NumPy) are provided.
Experiment Setup	Yes	For the step size parameter of the actor, critic, and the direct estimator of emphatic weightings, we tested values of the form 1/(2^i) where i ranged from 0 to 15. For the trace decay rate of the critic, we tested values of the form 1 - 1/(2^j) where j ranged from 0 to 6. The discount factor was .95.