Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling

Authors: Yuejiang Liu, Jubayer Hamid, Annie Xie, Yoonho Lee, Max Du, Chelsea Finn

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks. Videos and code are available at https://bid-robot.github.io. Empirically, we validate our theoretical analysis through a one-dimensional diagnostic simulation and evaluate our decoding method on two state-of-the-art generative policies across seven simulations and two real-world tasks ( 5).
Researcher Affiliation Academia Yuejiang Liu , Jubayer Ibn Hamid , Annie Xie, Yoonho Lee, Maximilian Du, Chelsea Finn Department of Computer Science, Stanford University
Pseudocode Yes Algorithm 1 Bidirectional Decoding
Open Source Code Yes Our code for the experiments with Diffusion Policy is available at https://github.com/YuejiangLIU/bid_diffusion and our code for the experiments with VQ-BET is available at https://github.com/Jubayer-Hamid/bid_lerobot.
Open Datasets Yes We will then evaluate BID on seven tasks across three simulation benchmarks, including Push-T (Chi et al., 2023), Robo Mimic (Mandlekar et al., 2022), and Franka Kitchen (Gupta et al., 2020). We use the training data collected from human demonstrations in each benchmark. Push-T: We adopt the Push-T environment introduced in (Chi et al., 2023)... Robomimic: We use five tasks in the Robomimic suite (Mandlekar et al., 2022)... Franka Kitchen: We use the Franka Kitchen environment from (Gupta et al., 2020)...
Dataset Splits No The paper mentions using "training data collected from human demonstrations" for various benchmarks (e.g., Push-T, RoboMimic, Franka Kitchen) and details the number of demonstrations/episodes for these training datasets. It also describes evaluation settings like "Each method-setting pair is tested over 20 episodes". However, it does not explicitly provide specific percentages, absolute sample counts, or clear methodologies for how these datasets were partitioned into training, validation, and test sets for the authors' experiments.
Hardware Specification Yes We measure the computational time on a desktop equipped with an NVIDIA A5000 GPU.
Software Dependencies No The paper states that its implementation is built upon the official code of Chi et al. (2023) for Diffusion Policy and the code of Le Robot (Cadene et al., 2024) for VQ-BET. However, it does not provide specific version numbers for any ancillary software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup Yes We evaluate BID with a batch size of N = 16 and a mode size of K = 3. The core hyperparameters are summarized in Table 6. name value batch size N 16 mode size K 3 prediction length l 16 temporal coherence decay ρ 0.5 moving average decay λ 0.5 For each simulation task, we train the model for 100-1000 epochs to reach near-optimal performance. We evaluate it in closed-loop operations, i.e., action horizon is set to 1. For forward contrast, we train the weak policy for 10-100 epochs, resulting in a suboptimal policy for each task.