DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression
Authors: Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, Feifei Feng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments using multiple real robots to validate the effectiveness of Di VLA. Our tests include a challenging factory sorting task, where Di VLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model s decision process. Additionally, we test Di VLA on a zero-shot binpicking task, achieving 63.7% accuracy on 102 previously unseen objects. |
| Researcher Affiliation | Collaboration | 1Midea Group, Shanghai, China 2East China Normal University, Shanghai, China 3Shanghai University, Shanghai, China. Correspondence to: Yichen Zhu <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in regular paragraph text, such as in Section 3, 'Methodology', and Section 3.1, 'Architecture'. There are no figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | diffusion-vla.github.io (under the title) and "Figure 1: Our proposed Diffusion VLA model unifies autoregressive and diffusion modeling to enable self-reasoning and robot policy learning. This approach generalizes effectively to visual changes, supports zero-shot bin picking, adapts to new robot morphologies, performs visual question-answering, and generates actions with high speed. Open-Sourced Data" (Figure 1 caption). |
| Open Datasets | Yes | Pretraining Data. We consider OXE (O Neill et al., 2023) and Droid (Khazatsky et al., 2024) dataset for pretraining. We use Droid data to pre-train Di VLA-2B and Di VLA7B. Because larger models typically needs more data for training, we use OXE and Droid together for pre-training Di VLA-72B. |
| Dataset Splits | No | Data for finetuning. We explore four experimental settings: factory sorting, bin picking, multi-task learning, and table bussing. The first three settings are conducted with the Franka robot, while the table bussing task utilizes the bimanual Agile X robot. Our dataset includes 500 trajectories for the factory sorting task and 580 trajectories for multi-task learning. The bin picking task is designed as a zero-shot task, so no training data was collected for it. For the table bussing task, we gathered 400 trajectories, where objects are randomly placed on the table, often overlapping with each other. We evaluate each method with a total of 77 trials for multi-task learning and 45 trials for visual generalization. However, it doesn't provide specific train/validation/test splits, only total data used for finetuning and evaluation trials. |
| Hardware Specification | Yes | Di VLA is data-efficient and fast at inference; our smallest Di VLA-2B runs 82Hz on a single A6000 GPU. Our results demonstrate remarkable speed improvements: the Di VLA-2B model achieves an impressive 82 Hz on an A6000 GPU, showcasing exceptional performance. Similarly, Di VLA-7B achieves a control frequency of 42 Hz, which is 8 times faster than Open VLA at the same model size. |
| Software Dependencies | No | The paper mentions several software components and frameworks, including Sig LIP, Qwen2VL, Diffusion Policy, Lo RA, Distil BERT, and v LLM framework, but it does not specify explicit version numbers for these software dependencies, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | Implementation details and pretraiend data. The model is pre-trained on the Droid (Khazatsky et al., 2024) dataset. We then finetune our model on evaluation tasks, similar to the setting as π0 (Black et al., 2024). We use Lo RA (Hu et al., 2021) to fine-tune the VLM models. We use 2e-5 as a fixed learning rate to train the model, similar to Open VLA. The visual encoder and VLM are frozen. We apply Lo RA on VLM for fine-tuning. Training is conducted over 20 epochs, as we find that Open VLA typically requires longer training times for convergence. |