SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration
Authors: Giulia Vezzani, Dhruva Tirumala, Markus Wulfmeier, Dushyant Rao, Abbas Abdolmaleki, Ben Moran, Tuomas Haarnoja, Jan Humplik, Roland Hafner, Michael Neunert, Claudio Fantacci, Tim Hertweck, Thomas Lampe, Fereshteh Sadeghi, Nicolas Heess, Martin Riedmiller
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our method in robotic manipulation and locomotion domains. The goal of our evaluation is to study how different skill transfer mechanisms fare in new tasks which require skill reuse. In the manipulation setting (Fig. 3 (left)), we utilise a simulated Sawyer robot arm equipped with Robotiq gripper and a set of three objects (green, yellow and blue) (Wulfmeier et al., 2020). We consider four tasks of increasing complexity: Lift the green object, Stack green on yellow, building a Pyramid of green on yellow and blue, and Triple stack with green on yellow and yellow on blue. Every harder task leverages previous task solutions as skills: this is a natural transfer setting to test an agent s ability to build on previous knowledge and solve increasingly complex tasks. For locomotion (Fig. 3 (right)), we consider two tasks with a simulated version of the OP3 humanoid robot (Robotis OP3): Get Up And Walk and Goal Scoring. The Get Up And Walk task requires the robot to compose two skills: one to get up off of the floor and one to walk. In the Goal Scoring task, the robot gets a sparse reward for scoring a goal, with a wall as an obstacle. The Goal Scoring task uses a single skill but is transferred to a setting with different environment dynamics. This allows us to extend the study beyond skill composition to skill adaptability; both of which are important requirements when operating in the real world. All the considered transfer tasks use sparse rewards7 except the Get Up And Walk task where a dense walking reward is given but only if the robot is standing. We consider an off-policy distributed learning setup with a single actor and experience replay. For each setting we plot the mean performance averaged across 5 seeds with the shaded region representing one standard deviation. More details on the skills and tasks can be found in Appendix C. |
| Researcher Affiliation | Industry | Giulia Vezzani EMAIL Google Deep Mind Dhruva Tirumala EMAIL Google Deep Mind Markus Wulfmeier EMAIL Google Deep Mind Dushyant Rao EMAIL Google Deep Mind Abbas Abdolmaleki EMAIL Google Deep Mind Ben Moran EMAIL Google Deep Mind Tuomas Haarnoja EMAIL Google Deep Mind Jan Humplik EMAIL Google Deep Mind Roland Hafner EMAIL Google Deep Mind Michael Neunert EMAIL Google Deep Mind Claudio Fantacci EMAIL Google Deep Mind Tim Hertweck EMAIL Google Deep Mind Thomas Lampe EMAIL Google Deep Mind Fereshteh Sadeghi EMAIL Google Deep Mind Nicolas Heess EMAIL Google Deep Mind Martin Riedmiller EMAIL Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Skill S Training |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. The provided link refers to additional videos. |
| Open Datasets | No | The paper describes experiments in simulated environments (simulated Sawyer robot arm, simulated OP3 humanoid robot using Mu Jo Co physics simulator), but it does not provide concrete access information (links, DOIs, repositories, or formal citations for specific datasets used in the experiments. While the environments themselves (MuJoCo, Robotis OP3) are mentioned, there is no explicit information on how to access the *data* collected or used for training specific to this paper's experiments. |
| Dataset Splits | No | The paper describes training agents in simulated environments and collecting data through interaction, rather than using predefined static datasets with explicit training, validation, and test splits. It mentions collecting 'data' and using 'experience replay' and plotting 'mean performance averaged across 5 seeds', but does not specify traditional dataset splits (e.g., percentages or sample counts for train/test/validation sets) for a fixed dataset, as the data is generated dynamically during training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. It describes the simulated robot platforms and mentions a 'distributed learning setup' but lacks concrete hardware information. |
| Software Dependencies | No | The paper mentions software components like the 'Mu Jo Co' physics simulator and various algorithms (MPO, RHPO, CRR), but it does not provide specific version numbers for these or any other ancillary software dependencies (e.g., Python, TensorFlow, PyTorch versions). Thus, a reproducible description of the software environment with version numbers is not present. |
| Experiment Setup | Yes | The paper provides extensive details on the experimental setup in Appendix C.3, titled 'Training and networks'. This includes specific 'Learner parameters' and 'Scheduler specific parameters' for both manipulation and locomotion tasks. For example, it lists 'Batch size: 256', 'Trajectory length: 10', 'Learning rate: 3e-4', 'Min replay size to sample: 200', 'Samples per insert: 50', 'Replay size: 1e6', 'Target Actor update period: 25', 'Target Critic update period: 100', 'E-step KL constraint ϵ: 0.1', 'M-step KL constraints: 5e-3 (mean) and 1e-5 (covariance)' for manipulation, along with network parameters and scheduler-specific parameters like 'Available skill lengths: K = {n * 10} from 1 to 10' and 'Initial skill lengths biases: 0.95 for k = 100, 0.005 for the other skill lengths'. |