VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation

Authors: Wei Zhao, Pengxiang Ding, Zhang Min, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience. We conduct a series of experiments to assess the effectiveness of the proposed method from multiple perspectives.
Researcher Affiliation Academia 1Westlake University 2Zhejiang University 3Xi an Jiaotong University. Corresponding author: EMAIL
Pseudocode No The paper describes methods and processes (e.g., architecture, training paradigm, data collection) but does not present them in a structured pseudocode or algorithm block.
Open Source Code Yes The model, data and code will be publicly available at https://github.com/whichwhichgone/VLAS.
Open Datasets Yes We also present two new datasets, SQA and CSI for community further study. The model, data and code will be publicly available at https://github.com/whichwhichgone/VLAS.
Dataset Splits Yes We perform fine-tuning in Stage I on the train-clean-100 split of the Libri Speech dataset for 5 epochs... For the CALVIN dataset, which contains 389 textual instructions... To better evaluate our model s generalization capability to novel scenes, we conducted experiments in which the model was trained on ABC splits and tested on the D split.
Hardware Specification Yes All models are trained using 8 A100 GPUs, except for the fine-tuning in Stage I. We empirically found that employing a single GPU for coarse-grained speech alignment yields better performance.
Software Dependencies No The paper mentions optimization techniques and precision settings (Adam optimizer, Flash Attention 2, BF16, TF32) but does not provide specific version numbers for software libraries or dependencies like PyTorch, TensorFlow, or specific Python versions.
Experiment Setup Yes We perform fine-tuning in Stage I on the train-clean-100 split of the Libri Speech dataset for 5 epochs, using a learning rate of 1e-3 and a batch size of 16. Subsequently, the fine-tuning in Stage II is conducted on our SQA dataset, along with the released LLa VA 665K instruction-following dataset and the train-clean-360 split of Libri Speech, for 1 epoch using a learning rate of 2e-5 and a batch size of 16. Finally, we fine-tune the model on the CSI robot manipulation dataset for 1 epoch, with a learning rate of 2e-5 and a batch size of 16. Specifically, we combined actions from 5 time steps into a single training label to increase the operating frequency of the robot policy model. The Adam optimizer without weight decay and a cosine learning rate schedule with a 3% warmup ratio are used throughout the experiments.