reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Authors: Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on both simulation and real-world benchmarks. On two widely adopted simulation benchmarks, LIBERO-LONG (Liu et al., 2024) (10 tasks) and CALVIN ABCD (Mees et al., 2022) (34 tasks), our method demonstrates a 10.4% improvement in success rate and a 0.75 increase in average task completion length compared to state-of-the-art baselines.
Researcher Affiliation	Collaboration	1 Shanghai AI Laboratory 2 CFCS, School of CS, Peking University 3 National Engineering Research Center for Software Engineering, Peking University 4 School of Software & Microelectronics, Peking University 5 Key Laboratory of High Confidence Software Technologies (PKU), Ministry of Education 6 Chinese University of Hong Kong
Pseudocode	No	The paper describes its methodology using prose and architectural diagrams (Figure 2, Figure A-1) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models are publicly available at https://github.com/Open Robot Lab/Seer/
Open Datasets	Yes	It is initially pretrained on large-scale robotic datasets, such as DROID, and can be adapted to realworld scenarios with a little fine-tuning data. We conduct experiments on two simulation benchmarks LIBERO-LONG (Liu et al., 2024), CALVIN ABC-D (Mees et al., 2022).
Dataset Splits	Yes	For pre-training, we utilize the official robot play data with no language instructions, while the remaining data with full annotations is used for fine-tuning. Evaluation is conducted in Environment D, which differs visually from Environments A, B, and C where the training data was collected. In the fine-tuning phase, we capture RGB images, robot states, and actions at 15 Hz, collecting 100 demonstrations per task. The results, shown in Figure 3, demonstrate that our method consistently enhances policy performance across varying data sizes. Notably, under data-scarce conditions with only 10% of the training data, the pre-trained policy achieves a 187% relative improvement in success rate on LIBERO-LONG and a 150% relative improvement in average task length on CALVIN ABC-D compared to training from scratch.
Hardware Specification	Yes	For all simulation results, we use eight 4090 GPUS to pre-train and fine-tune.
Software Dependencies	No	No specific software dependencies with version numbers (like Python, PyTorch, or CUDA versions) are mentioned in the paper.
Experiment Setup	Yes	Table A-I: Training hyperparameters. Batch Size 640 (LIBERO & CALVIN) / 2048 (Real) 512; Learning Rate 1e-4 1e-3; Optimizer Adam W Adam W; Learning Rate Schedule Cosine decay Cosine decay; Training Epochs 30 (LIBERO & Real) / 20 (CALVIN) 40 (LIBERO & Real) / 20 (CALVIN); History Length 7 (LIBERO & Real) / 10 (CALVIN) 7 (LIBERO & Real) / 10 (CALVIN); Action Chunk Length 3 3.