reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation

Authors: Wei Zhao, Pengxiang Ding, Zhang Min, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience. We conduct a series of experiments to assess the effectiveness of the proposed method from multiple perspectives.
Researcher Affiliation	Academia	1Westlake University 2Zhejiang University 3Xi an Jiaotong University. Corresponding author: EMAIL
Pseudocode	No	The paper describes methods and processes (e.g., architecture, training paradigm, data collection) but does not present them in a structured pseudocode or algorithm block.
Open Source Code	Yes	The model, data and code will be publicly available at https://github.com/whichwhichgone/VLAS.
Open Datasets	Yes	We also present two new datasets, SQA and CSI for community further study. The model, data and code will be publicly available at https://github.com/whichwhichgone/VLAS.
Dataset Splits	Yes	We perform fine-tuning in Stage I on the train-clean-100 split of the Libri Speech dataset for 5 epochs... For the CALVIN dataset, which contains 389 textual instructions... To better evaluate our model s generalization capability to novel scenes, we conducted experiments in which the model was trained on ABC splits and tested on the D split.
Hardware Specification	Yes	All models are trained using 8 A100 GPUs, except for the fine-tuning in Stage I. We empirically found that employing a single GPU for coarse-grained speech alignment yields better performance.
Software Dependencies	No	The paper mentions optimization techniques and precision settings (Adam optimizer, Flash Attention 2, BF16, TF32) but does not provide specific version numbers for software libraries or dependencies like PyTorch, TensorFlow, or specific Python versions.
Experiment Setup	Yes	We perform fine-tuning in Stage I on the train-clean-100 split of the Libri Speech dataset for 5 epochs, using a learning rate of 1e-3 and a batch size of 16. Subsequently, the fine-tuning in Stage II is conducted on our SQA dataset, along with the released LLa VA 665K instruction-following dataset and the train-clean-360 split of Libri Speech, for 1 epoch using a learning rate of 2e-5 and a batch size of 16. Finally, we fine-tune the model on the CSI robot manipulation dataset for 1 epoch, with a learning rate of 2e-5 and a batch size of 16. Specifically, we combined actions from 5 time steps into a single training label to increase the operating frequency of the robot policy model. The Adam optimizer without weight decay and a cosine learning rate schedule with a 3% warmup ratio are used throughout the experiments.