reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling

Authors: Yuejiang Liu, Jubayer Hamid, Annie Xie, Yoonho Lee, Max Du, Chelsea Finn

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks. Videos and code are available at https://bid-robot.github.io. Empirically, we validate our theoretical analysis through a one-dimensional diagnostic simulation and evaluate our decoding method on two state-of-the-art generative policies across seven simulations and two real-world tasks ( 5).
Researcher Affiliation	Academia	Yuejiang Liu , Jubayer Ibn Hamid , Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn Department of Computer Science, Stanford University
Pseudocode	Yes	Algorithm 1 Bidirectional Decoding
Open Source Code	Yes	Our code for the experiments with Diffusion Policy is available at https://github.com/YuejiangLIU/bid_diffusion and our code for the experiments with VQ-BET is available at https://github.com/Jubayer-Hamid/bid_lerobot.
Open Datasets	Yes	We will then evaluate BID on seven tasks across three simulation benchmarks, including Push-T (Chi et al., 2023), Robo Mimic (Mandlekar et al., 2022), and Franka Kitchen (Gupta et al., 2020). We use the training data collected from human demonstrations in each benchmark. Push-T: We adopt the Push-T environment introduced in (Chi et al., 2023)... Robomimic: We use five tasks in the Robomimic suite (Mandlekar et al., 2022)... Franka Kitchen: We use the Franka Kitchen environment from (Gupta et al., 2020)...
Dataset Splits	No	The paper mentions using "training data collected from human demonstrations" for various benchmarks (e.g., Push-T, RoboMimic, Franka Kitchen) and details the number of demonstrations/episodes for these training datasets. It also describes evaluation settings like "Each method-setting pair is tested over 20 episodes". However, it does not explicitly provide specific percentages, absolute sample counts, or clear methodologies for how these datasets were partitioned into training, validation, and test sets for the authors' experiments.
Hardware Specification	Yes	We measure the computational time on a desktop equipped with an NVIDIA A5000 GPU.
Software Dependencies	No	The paper states that its implementation is built upon the official code of Chi et al. (2023) for Diffusion Policy and the code of Le Robot (Cadene et al., 2024) for VQ-BET. However, it does not provide specific version numbers for any ancillary software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup	Yes	We evaluate BID with a batch size of N = 16 and a mode size of K = 3. The core hyperparameters are summarized in Table 6. name value batch size N 16 mode size K 3 prediction length l 16 temporal coherence decay ρ 0.5 moving average decay λ 0.5 For each simulation task, we train the model for 100-1000 epochs to reach near-optimal performance. We evaluate it in closed-loop operations, i.e., action horizon is set to 1. For forward contrast, we train the weak policy for 10-100 epochs, resulting in a suboptimal policy for each task.