reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Authors: Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, Feifei Feng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments using multiple real robots to validate the effectiveness of Di VLA. Our tests include a challenging factory sorting task, where Di VLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model s decision process. Additionally, we test Di VLA on a zero-shot binpicking task, achieving 63.7% accuracy on 102 previously unseen objects.
Researcher Affiliation	Collaboration	1Midea Group, Shanghai, China 2East China Normal University, Shanghai, China 3Shanghai University, Shanghai, China. Correspondence to: Yichen Zhu <EMAIL>.
Pseudocode	No	The paper describes the methodology in regular paragraph text, such as in Section 3, 'Methodology', and Section 3.1, 'Architecture'. There are no figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	diffusion-vla.github.io (under the title) and "Figure 1: Our proposed Diffusion VLA model unifies autoregressive and diffusion modeling to enable self-reasoning and robot policy learning. This approach generalizes effectively to visual changes, supports zero-shot bin picking, adapts to new robot morphologies, performs visual question-answering, and generates actions with high speed. Open-Sourced Data" (Figure 1 caption).
Open Datasets	Yes	Pretraining Data. We consider OXE (O Neill et al., 2023) and Droid (Khazatsky et al., 2024) dataset for pretraining. We use Droid data to pre-train Di VLA-2B and Di VLA7B. Because larger models typically needs more data for training, we use OXE and Droid together for pre-training Di VLA-72B.
Dataset Splits	No	Data for finetuning. We explore four experimental settings: factory sorting, bin picking, multi-task learning, and table bussing. The first three settings are conducted with the Franka robot, while the table bussing task utilizes the bimanual Agile X robot. Our dataset includes 500 trajectories for the factory sorting task and 580 trajectories for multi-task learning. The bin picking task is designed as a zero-shot task, so no training data was collected for it. For the table bussing task, we gathered 400 trajectories, where objects are randomly placed on the table, often overlapping with each other. We evaluate each method with a total of 77 trials for multi-task learning and 45 trials for visual generalization. However, it doesn't provide specific train/validation/test splits, only total data used for finetuning and evaluation trials.
Hardware Specification	Yes	Di VLA is data-efficient and fast at inference; our smallest Di VLA-2B runs 82Hz on a single A6000 GPU. Our results demonstrate remarkable speed improvements: the Di VLA-2B model achieves an impressive 82 Hz on an A6000 GPU, showcasing exceptional performance. Similarly, Di VLA-7B achieves a control frequency of 42 Hz, which is 8 times faster than Open VLA at the same model size.
Software Dependencies	No	The paper mentions several software components and frameworks, including Sig LIP, Qwen2VL, Diffusion Policy, Lo RA, Distil BERT, and v LLM framework, but it does not specify explicit version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup	Yes	Implementation details and pretraiend data. The model is pre-trained on the Droid (Khazatsky et al., 2024) dataset. We then finetune our model on evaluation tasks, similar to the setting as π0 (Black et al., 2024). We use Lo RA (Hu et al., 2021) to fine-tune the VLM models. We use 2e-5 as a fixed learning rate to train the model, similar to Open VLA. The visual encoder and VLM are frozen. We apply Lo RA on VLM for fine-tuning. Training is conducted over 20 epochs, as we find that Open VLA typically requires longer training times for convergence.