Two-Level Actor-Critic Using Multiple Teachers
Authors: Su Zhang, Srijita Das, Sriram Ganapathi Subramanian, Matthew E. Taylor
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experimental results on a discrete control task (Door Key) and two continuous control robotics tasks (Hopper and Pick & Place) to demonstrate the effectiveness of our method. The experiments are designed to investigate the following two research questions: R1: Can the two-level actor-critic method handle a mixture of sub-optimal teachers relative to existing methods? R2: Can the two-level actor-critic method incorporate multiple partial teachers with different areas of expertise? |
| Researcher Affiliation | Academia | Su Zhang EMAIL Washington State University Pullman, United States Srijita Das EMAIL University of Alberta Edmonton, Canada Sriram Ganapathi Subramanian EMAIL Vector Institute Toronto, Canada Matthew E. Taylor EMAIL University of Alberta Alberta Machine Intelligence Institute Edmonton, Canada |
| Pseudocode | Yes | Algorithm 1: Two-Level Actor-Critic using Multiple Teachers |
| Open Source Code | No | The A2C baseline uses the default setting of the Stable Baseline3 (Raffin et al., 2019). For DQN-TLQL, since we do not have access to the original codebase, we refer to the Stable Baseline3 and implement the DQN version of the Two-level Q-learning algorithm as described in (Li et al., 2019b). The implementation and parameter settings of AC-Teach are from the original codebase of Kurenkov et al. s work (Kurenkov et al., 2019)3. 3https://github.com/Stanford VL/ac-teach |
| Open Datasets | Yes | Door & Key Environment is a grid room environment from the Minimalistic Gridworld Environment (Mini Grid) (Chevalier-Boisvert et al., 2018). Hopper is a two-dimensional one-legged robot, and the task is to hop forward as far as possible. We use the Hopper Py Bullet Env-v0 from the Py Bullet Gymperium (Ellenberger, 2018 2019) for the experiments. Pick & Place is a robotic manipulation task from the Fetch environments (Plappert et al., 2018). |
| Dataset Splits | No | The mean reward of this policy over 1000 testing episode is 1928 and standard deviation is 646. The paper refers to environments and evaluates performance over episodes, but does not provide specific training/test/validation splits for a static dataset as requested. |
| Hardware Specification | No | Due to the limited computing resources and the observation of its relatively steady performance, the AC-Teach results are averaged over 3 seeds. The paper mentions "limited computing resources" but does not specify any particular hardware like GPU/CPU models or memory. |
| Software Dependencies | No | The A2C baseline uses the default setting of the Stable Baseline3 (Raffin et al., 2019). The paper mentions "Stable Baseline3" but does not provide a specific version number for it, nor for any other key software dependencies. |
| Experiment Setup | Yes | B Hyperparameter Settings: Table 4: Hyperparameters of A2C in Door Key, Table 5: Hyperparameters of TL-AC in Door Key, Table 6: Hyperparameters of DQN-TLQL in Door Key, Table 7: Hyperparameters of A2C in Hopper, Table 8: Hyperparameters of TL-AC in Hopper, Table 9: Hyperparameters of A2C in Pick & Place, Table 10: Hyperparameters of TL-AC in Pick & Place |