Two-Level Actor-Critic Using Multiple Teachers

Authors: Su Zhang, Srijita Das, Sriram Ganapathi Subramanian, Matthew E. Taylor

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experimental results on a discrete control task (Door Key) and two continuous control robotics tasks (Hopper and Pick & Place) to demonstrate the effectiveness of our method. The experiments are designed to investigate the following two research questions: R1: Can the two-level actor-critic method handle a mixture of sub-optimal teachers relative to existing methods? R2: Can the two-level actor-critic method incorporate multiple partial teachers with different areas of expertise?
Researcher Affiliation Academia Su Zhang EMAIL Washington State University Pullman, United States Srijita Das EMAIL University of Alberta Edmonton, Canada Sriram Ganapathi Subramanian EMAIL Vector Institute Toronto, Canada Matthew E. Taylor EMAIL University of Alberta Alberta Machine Intelligence Institute Edmonton, Canada
Pseudocode Yes Algorithm 1: Two-Level Actor-Critic using Multiple Teachers
Open Source Code No The A2C baseline uses the default setting of the Stable Baseline3 (Raffin et al., 2019). For DQN-TLQL, since we do not have access to the original codebase, we refer to the Stable Baseline3 and implement the DQN version of the Two-level Q-learning algorithm as described in (Li et al., 2019b). The implementation and parameter settings of AC-Teach are from the original codebase of Kurenkov et al. s work (Kurenkov et al., 2019)3. 3https://github.com/Stanford VL/ac-teach
Open Datasets Yes Door & Key Environment is a grid room environment from the Minimalistic Gridworld Environment (Mini Grid) (Chevalier-Boisvert et al., 2018). Hopper is a two-dimensional one-legged robot, and the task is to hop forward as far as possible. We use the Hopper Py Bullet Env-v0 from the Py Bullet Gymperium (Ellenberger, 2018 2019) for the experiments. Pick & Place is a robotic manipulation task from the Fetch environments (Plappert et al., 2018).
Dataset Splits No The mean reward of this policy over 1000 testing episode is 1928 and standard deviation is 646. The paper refers to environments and evaluates performance over episodes, but does not provide specific training/test/validation splits for a static dataset as requested.
Hardware Specification No Due to the limited computing resources and the observation of its relatively steady performance, the AC-Teach results are averaged over 3 seeds. The paper mentions "limited computing resources" but does not specify any particular hardware like GPU/CPU models or memory.
Software Dependencies No The A2C baseline uses the default setting of the Stable Baseline3 (Raffin et al., 2019). The paper mentions "Stable Baseline3" but does not provide a specific version number for it, nor for any other key software dependencies.
Experiment Setup Yes B Hyperparameter Settings: Table 4: Hyperparameters of A2C in Door Key, Table 5: Hyperparameters of TL-AC in Door Key, Table 6: Hyperparameters of DQN-TLQL in Door Key, Table 7: Hyperparameters of A2C in Hopper, Table 8: Hyperparameters of TL-AC in Hopper, Table 9: Hyperparameters of A2C in Pick & Place, Table 10: Hyperparameters of TL-AC in Pick & Place