Off-Policy Actor-Critic with Emphatic Weightings

Authors: Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.
Researcher Affiliation Academia Eric Graves EMAIL Ehsan Imani EMAIL Raksha Kumaraswamy EMAIL Martha White EMAIL Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University of Alberta Edmonton, Alberta, Canada T6G 2E8
Pseudocode Yes Algorithm 1 Online Actor Critic with Emphatic weightings (ACE)
Open Source Code Yes 7. Code is available at: https://github.com/gravesec/actor-critic-with-emphatic-weightings.
Open Datasets Yes Next, we revisit the simple counterexample from Section 8.1 to examine how the trade-offparameter can affect the learned policy. Next, we illustrate some issues with using the emphatic trace in the ACE algorithm by testing it on a modified version of the counterexample. We then move to two classic control environments to study the two estimators discussed in Section 5.2 and their effects on the learned policy. Finally, we test several variants of ACE on a challenging environment designed to illustrate the issues associated with both estimators. Puddle World (Degris et al., 2012b) and Mountain Car (Moore, 1990). The final environment was implemented using Minigrid (Chevalier-Boisvert et al., 2018), and is depicted in Figure 9.
Dataset Splits No Each combination of parameters for each algorithm was run on 5 different trajectories generated by a fixed behaviour policy interacting with the environment for 100,000 time steps. The learned policies were saved every 1,000 time steps and evaluated 50 times using both the episodic and excursions objective functions from Section 3. For the episodic objective function, the policies were evaluated by creating 50 different instances of the environment and executing the target policy from the starting state until termination or 1000 time steps had elapsed. The excursions objective function was evaluated similarly, but with the environment s starting state drawn from the behaviour policy s steady state distribution, chosen by running the behaviour policy for 50,000 time steps and saving every thousandth state. This text describes the evaluation methodology and how data is generated during experiments but does not provide specific training/test/validation splits for a predefined dataset.
Hardware Specification No This research was enabled in part by support provided by Sci Net and the Digital Research Alliance of Canada. This is an acknowledgement of support for computational resources but does not specify particular hardware models (e.g., specific GPUs, CPUs, or memory details) used for the experiments.
Software Dependencies No The final environment was implemented using Minigrid (Chevalier-Boisvert et al., 2018). While Minigrid is mentioned, no specific version number for it or any other key software libraries (e.g., Python, PyTorch/TensorFlow, NumPy) are provided.
Experiment Setup Yes For the step size parameter of the actor, critic, and the direct estimator of emphatic weightings, we tested values of the form 1/(2^i) where i ranged from 0 to 15. For the trace decay rate of the critic, we tested values of the form 1 - 1/(2^j) where j ranged from 0 to 6. The discount factor was .95.