ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning
Authors: Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, Donglin Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Reinbo T achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks. |
| Researcher Affiliation | Academia | 1Zhejiang University, Hangzhou, China 2Westlake University, Hangzhou, China. Correspondence to: Donglin Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Reinbo T: Test-time Execution |
| Open Source Code | No | The paper does not contain an unambiguous statement of code release or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We first construct a mixed-quality dataset based on CALVIN (Mees et al., 2022)... We initialize its weights with the pre-trained model weights, which are derived from the generated video pre-training on the Ego4d (Grauman et al., 2022) dataset consistent with GR-1. |
| Dataset Splits | Yes | This dataset contains a small amount of data with language instructions in CALVIN ABC (about 50 trajectories per task) and a large amount of autonomous data without language instructions. In addition to the original data collected by human teleoperation without language instructions in CALVIN (more than 20,000 trajectories), the autonomous data also contains failure data generated by the interaction between the trained VLA behavioral policy Robo Flamingo (Li et al., b) and the environment CALVIN D (more than 10,000 trajectories). We study training on this mixed-quality data, then fine-tune a small amount of data with language instructions, and finally test the generalization performance on CALVIN D. |
| Hardware Specification | No | The paper mentions conducting real-world tasks on a 'robotic arm UR5', which is a robot platform, but does not provide specific details about the computational hardware (e.g., GPU, CPU models) used for training or running the experiments. |
| Software Dependencies | No | The paper mentions using 'GPT2 (Radford et al., 2019) structure' and 'Optimizer Adam (Kingma, 2014)', but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 3. Network hyperparameters configuration. Table 4. Training hyperparameters configuration. Parameter Value Return To Go loss weight λ 0.001 Expectile regression parameter m 0.9 Gradient clip 1.0 Epochs 50 Warm-up epochs 1 Batch size 32 Learning rate 0.001 Weight decay 0.01 Dropout rate 0.1 Reward weight w4 i=1 0.1, 0.1, 0.01, 0.1 Optimizer Adam (β1 = 0.9, β2 = 0.999) |