reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Stem-OB: Generalizable Visual Imitation Learning with Stem-Like Convergent Observation through Diffusion Inversion

Authors: Kaizhe Hu, Zihang Rui, Yao He, Yuyao Liu, Pu Hua, Huazhe Xu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical study demonstrates the effectiveness of our approach in a variety of simulated and real-world tasks and a range of different perturbations. Stem-OB proves to be particularly effective in real-world tasks where appearance and lighting changes hamper the other baselines, establishing an overall improvement in the success rate of 22.2%.
Researcher Affiliation	Academia	Kaizhe Hu123 Zihang Rui1 Yao He4 Yuyao Liu1 Pu Hua123 Huazhe Xu123 1 Tsinghua University 2 Shanghai Qi Zhi Institute 3 Shanghai AI Lab 4 Stanford University EMAIL, EMAIL
Pseudocode	No	The paper describes methods in prose and equations, such as in Section 4 and Section 5.3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Reproducibility: The main algorithm of our method is simple as applying the open-sourced DDPM inversion method on the dataset before training. We’ve provided the code for our method in the supplementary material.
Open Datasets	Yes	Our simulation experiments consider different tasks within two frameworks: a photorealistic simulation platform SAPIEN 3 (Xiang et al., 2020) and a less realistic framework Mimic Gen (Mandlekar et al., 2023). We leverage the Mani Skill 3 dataset (Gu et al., 2023; Tao et al., 2024), collected on SAPIEN 3, for benchmarking.
Dataset Splits	Yes	The object locations in training set are randomly initialized within a specified area, and 100 demonstrations are collected per task. For testing, nine predefined target positions are used. [...] 50 episodes are tested for each setting. [...] For evaluation, we employ a single image as the input to the policy, using 500 samples out of a total of 1000 demos for training. [...] 300 episodes are tested for each setting of all the tasks.
Hardware Specification	No	The paper does not provide specific details about the computing hardware (e.g., GPU models, CPU models, memory) used for training or inference, only mentioning the robot arm and cameras for real-world experiments.
Software Dependencies	No	The paper mentions using Diffusion Policy (DP) and Stable Diffusion models but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, CUDA, or other libraries.
Experiment Setup	Yes	The hyperparameters for Diffusion Policy across all experiments are listed in Tab. 6. We use the same hyperparameters of diffusion policy for all the experiments. Table 6 provides specific values such as 'batch size 128', 'num epochs 1500', 'learning rate initial 0.0001', and architecture details like 'unet down dims [256, 512,1024]'.