STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization

Authors: Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, Liqiang Nie

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and real-world tasks, with around 12% improvement over the baselines. Comprehensive experimental validation across multiple benchmarks and real-world tasks, demonstrating substantial improvements in both skill learning efficiency and task performance.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) 2Huawei Noah s Ark Lab. Correspondence to: Rui Shao <EMAIL>, Xiang Deng <EMAIL>.
Pseudocode Yes Algorithm 1 Rotation-augmented Residual Skill Quantization (Ra RSQ)
Open Source Code Yes STAR.github.io
Open Datasets Yes We evaluate STAR on two comprehensive manipulation benchmarks: LIBERO (130 tasks across five suites) and Meta World MT50 (50 distinct manipulation tasks), plus two real-world long-horizon tasks. For our experiments, we utilize 50 expert demonstrations per task from the author-provided dataset. For training, we collect 100 demonstrations per task using the scripted policies provided in the official Meta World codebase.
Dataset Splits No The paper mentions using '50 expert demonstrations per task' for LIBERO and '100 demonstrations per task' for Meta World MT50 for training, and 'evaluated performance using the Success Rate (SR) metric, calculated over 50 episodes per task' for evaluation. For real-world tasks, 'collected 45 demonstrations through human teleoperation' and 'conducted 10 trials per task'. While these are counts of demonstrations and evaluation episodes/trials, the paper does not specify explicit dataset splits (e.g., percentages for train/validation/test sets) for reproducibility of data partitioning.
Hardware Specification Yes The models are implemented in Py Torch and trained on a server with 8 Nvidia RTX L40S 48GB GPUs, with all models easily fitting on a single GPU.
Software Dependencies No The paper states 'The models are implemented in Py Torch' but does not specify a version number for PyTorch or any other key software libraries.
Experiment Setup Yes For the Rotation-augmented Residual Skill Quantization (Ra RSQ) module in our proposed STAR framework, we adopt a single-layer MLP as the encoder with a hidden dimension of 128. The decoder is implemented as a transformer with 4 attention heads, 4 decoder layers, and a hidden dimension of 128. We set the codebook size K = 16 and the quantization depth D = 2, with each skill abstraction spanning 8 timesteps. For the Causal Skill Transformer (CST), we utilize a Res Net-18 model trained from scratch as the visual encoder and a pre-trained CLIP-base model as the language encoder. The proprioception encoder is implemented as a single-layer MLP with a hidden dimension of 128. The transformer decoder consists of 6 layers, 6 attention heads, and an embedding dimension of 384. We set the start token dimension to 16, the beam size to 5, and the temperature to 1.0 for sampling. The observation window is fixed at 10 timesteps. We train the entire framework using the Adam W optimizer with a cosine decay learning schedule. For Ra RSQ module, we use a batch size of 1024, learning rate of 5.5e-5, and train for 100 epochs. For CST module, we use a batch size of 512, learning rate of 8e-4, and train for 500 epochs. Both modules use a warmup step of 10 epochs and weight decay of 1e-6. The loss weights for the first codebook prediction, second codebook prediction, and offset head prediction are set to 2.0, 1.0, and 20.0, respectively.