Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Authors: Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1 5 demonstrations, and effectively handles complex, dexterous tasks.
Researcher Affiliation Academia 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
Pseudocode No The paper describes the model architecture and diffusion formulation using mathematical equations and block diagrams, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We refer to the project page for the code and videos. We have fully open-sourced all our code, model weights, and fine-tuning datasets. We refer to the project page for more information.
Open Datasets Yes Specifically, our collection of pre-training datasets includes 46 datasets of various robots, with a total size of 1M+ trajectories and 21TB. More details and preprocessing are deferred to App. D. Our fine-tuning dataset is created using Mobile ALOHA robot (Fu et al., 2024), including 300+ tasks, 6K+ trajectories, and 3M+ frames. It is also one of the largest open-source multi-task bimanual robot datasets to date.
Dataset Splits Yes Wash Cup: 133 demos for seen cups combined and 0 demos for unseen cups; Pour Water: 350 demos for seen rooms combined and 0 demos for unseen rooms; Pour Water-L-1/3 & Pour Water-R-2/3: 18 demos for the water level of little, 19 demos for half, and 19 demos for full; Handover: 5 demos; Fold Shorts: 1 demo; Robot Dog: 68 demos. We trained ACT with 90% of the 6K fine-tuning episodes for 8000 epochs (about 8 days in total), while the remaining 10% is treated as the validation set.
Hardware Specification Yes The model is pre-trained on 48 H100 80GB GPUs for a month, giving a total of 1M training iteration steps. It can reduce the diffusion steps required to sample an action chunk from 100 steps to 5 steps, achieving an action chunk inference frequency of 6 Hz (action chunks per second) and an average action inference frequency of 381 Hz (actions per second) on the target robot s onboard RTX 4090 24GB GPU. We provide a detailed overview of the hardware configuration of our target dual-arm robot. Our model is deployed and evaluated on the Cobot Mobile ALOHA, a robot using the Mobile ALOHA system design (Fu et al., 2024) and manufactured by agilex.ai. The key features of the robot are illustrated in Fig. 7 . It is equipped with two wrist cameras, a front camera, a laptop, and an onboard battery. The robot s technical specifications are listed in Table 6.
Software Dependencies No The paper mentions using "Pytorch (Paszke et al., 2019)", "Deep Speed (Rasley et al., 2020)", and "Tensor Flow Dataset (TFD)", but it does not specify the version numbers for these software components. It also mentions using "DPM-Solver++ (Lu et al., 2022)" for inference and refers to official implementations for baselines without specifying versions of the underlying software dependencies.
Experiment Setup Yes We scale the size of RDT up to 1.2B parameters, establishing it as the currently largest diffusion-based robotic foundation model. The model is pre-trained on 48 H100 80GB GPUs for a month, giving a total of 1M training iteration steps. It takes three days to fine-tune this model using the same GPUs for 130K steps. We use the Adam W optimizer (Adam et al., 2019) with a constant learning rate scheduler and hyper-parameters in Table 10 in the pre-training and fine-tuning stages. During the training stage, we use the DDPM scheduler with a glide cosine scheduler (i.e., squaredcos cap v2) and a step number of 1000. During the sampling stage, we utilize the DPM-Solver++ (Lu et al., 2022) with a glide cosine scheduler and a sampling step number of 5. Table 10: Hyper-parameters for both pre-training and fine-tuning RDT. (Batch Size 32x48, Learning Rate 1e-4, Mixed Precision bf16, Warm-Up Steps 500, β1 0.9, β2 0.999, Weight Decay 1e-2)