ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Authors: Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, Donglin Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Reinbo T achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.
Researcher Affiliation Academia 1Zhejiang University, Hangzhou, China 2Westlake University, Hangzhou, China. Correspondence to: Donglin Wang <EMAIL>.
Pseudocode Yes Algorithm 1 Reinbo T: Test-time Execution
Open Source Code No The paper does not contain an unambiguous statement of code release or a link to a code repository for the described methodology.
Open Datasets Yes We first construct a mixed-quality dataset based on CALVIN (Mees et al., 2022)... We initialize its weights with the pre-trained model weights, which are derived from the generated video pre-training on the Ego4d (Grauman et al., 2022) dataset consistent with GR-1.
Dataset Splits Yes This dataset contains a small amount of data with language instructions in CALVIN ABC (about 50 trajectories per task) and a large amount of autonomous data without language instructions. In addition to the original data collected by human teleoperation without language instructions in CALVIN (more than 20,000 trajectories), the autonomous data also contains failure data generated by the interaction between the trained VLA behavioral policy Robo Flamingo (Li et al., b) and the environment CALVIN D (more than 10,000 trajectories). We study training on this mixed-quality data, then fine-tune a small amount of data with language instructions, and finally test the generalization performance on CALVIN D.
Hardware Specification No The paper mentions conducting real-world tasks on a 'robotic arm UR5', which is a robot platform, but does not provide specific details about the computational hardware (e.g., GPU, CPU models) used for training or running the experiments.
Software Dependencies No The paper mentions using 'GPT2 (Radford et al., 2019) structure' and 'Optimizer Adam (Kingma, 2014)', but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Table 3. Network hyperparameters configuration. Table 4. Training hyperparameters configuration. Parameter Value Return To Go loss weight λ 0.001 Expectile regression parameter m 0.9 Gradient clip 1.0 Epochs 50 Warm-up epochs 1 Batch size 32 Learning rate 0.001 Weight decay 0.01 Dropout rate 0.1 Reward weight w4 i=1 0.1, 0.1, 0.01, 0.1 Optimizer Adam (β1 = 0.9, β2 = 0.999)