Language Guided Skill Discovery

Authors: Seungeun Rho, Laura Smith, Tianyu Li, Sergey Levine, Xue Bin Peng, Sehoon Ha

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate our proposed LGSD by conducting a series of experiments on continuous control environments, encompassing both locomotion and manipulation setups. We aim to answer four questions: (1) Can prompting constrain the skill space into a desired semantic subspace? (2) Can language guidance lead to obtaining more diverse skills compared to unsupervised skill discovery baselines? (3) Can we utilize learned skills for solving downstream tasks? (4) Can we employ learned skills using natural language? Experimental setup We trained our algorithm and baselines using Isaac Gym (Makoviychuk et al., 2021), a high-throughput GPU-based physics simulator. For the language model, we employed gpt-4-turbo-2024-04-09(Achiam et al., 2023). We set the temperature parameter of the language model to 0 to get a consistent, low-variance measure of dlang. To reduce the number of unique queries, we discretized states and cached the input and output of these queries and reused them during training. We provide the exact prompts used for each experiments in Appendix G.
Researcher Affiliation Academia Seungeun Rho Georgia Institute of Technology EMAIL Laura Smith University of California, Berkley EMAIL Tianyu Li Georgia Institute of Technology EMAIL Sergey Levine University of California, Berkley EMAIL Xue Bin Peng Simon Fraser University EMAIL Sehoon Ha Georgia Institute of Technology EMAIL
Pseudocode Yes E FULL ALGORITHM OF LGSD Algorithm 1 Language Guided Skill Discovery 1: Initialize skill-conditioned policy π(a|s, z), representation function ϕ(s), prompt lprompt, LLM function LLM, language embedding model fembed, skill inference network ψ, Lagrange multiplier λ, and data buffer D 2: for i 1 to # of epochs do 3: for j 1 to # of episodes per epoch do 4: Sample skill z N(0, I) 5: while episode not terminates do 6: Sample action a π(a|s, z) 7: Execute a and receive s 8: Query LLM( |s, lprompt) to produce ldesc(s) and ldesc(s ) 9: Compute reward r = (ϕ(s ) ϕ(s))T z 10: Compute dlang(s, s ) using eq. (2) 11: Compute embedding vector es = fembed(ldesc(s)) 12: Add {s, a, r, s , dlang(s, s ), es, z} to buffer D 13: end while 14: end for 15: for {s, a, r, s , dlang(s, s ), es, z} in D do 16: Update ϕ to maximize E(s,z,s ) D (ϕ(s ) ϕ(s))T z + λ min(ϵ, dlang(s, s ) ϕ(s) ϕ(s ) 2 2) 17: Update λ to minimize E(s,z,s ) D λ min(ϵ, dlang(s, s ) ϕ(s) ϕ(s ) 2 2) 18: Update π using PPO with reward r 19: Update ψ to minimize Mean Squared Error between ψ(es) and z 20: end for 21: end for
Open Source Code No Reproducibility Statement We have made significant efforts to ensure the reproducibility of our work across various aspects. A comprehensive pseudo-code of our algorithm is available in Appendix E.
Open Datasets No The paper uses environments like "Ant" and "Franka Cube" within the Isaac Gym simulator, which are standard for reinforcement learning. However, it does not explicitly state that any *dataset generated* from these experiments is publicly available, nor does it provide specific access information (links, DOIs, etc.) for any external datasets used, beyond referencing the simulator itself.
Dataset Splits No The paper describes experiments in reinforcement learning environments (Ant, Franka Cube) where agents interact with a simulated environment. This typically does not involve predefined training/test/validation splits of a static dataset in the traditional supervised learning sense. No explicit information on such splits is provided.
Hardware Specification No Experimental setup We trained our algorithm and baselines using Isaac Gym (Makoviychuk et al., 2021), a high-throughput GPU-based physics simulator.
Software Dependencies No For the language model, we employed gpt-4-turbo-2024-04-09(Achiam et al., 2023). We used PPO (Schulman et al., 2017) for as our primary RL algorithm. To measure the difference between two language descriptions, we leverage a pre-trained natural language embedding model, Sentence-Transformer (Reimers & Gurevych, 2019). Table 3 lists optimizers and activation functions such as Adam(Kingma & Ba, 2014), ELU(Clevert et al., 2015), and ReLU. While specific models like 'gpt-4-turbo-2024-04-09' are mentioned, the paper does not list multiple key *software libraries* (e.g., Python, PyTorch) with their specific version numbers.
Experiment Setup Yes Table 3: Hyperparameters of LGSD Name Value Learning rate 0.0001 Optimizer Adam(Kingma & Ba, 2014) Minibatch size 32768(Ant) , 16384(Franka) Horizon length 32 PPO clip threshold 0.2 PPO number of epochs 5 GAE λ (Schulman et al., 2015) 0.95 Discount factor γ 0.99 Entropy coefficient 0.0001 Initial Lagrange coefficient λ 300 Dim. of skill z 2(Ant), 3(Franka) Policy network π MLP with [256, 256, 128], Activaion of π ELU(Clevert et al., 2015) Representation function ϕ MLP with [256, 256, 128] Activaion of ϕ Re LU Skill inference network ψ MLP with [256, 256, 128] Activaion of ψ Re LU