reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model

Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we first demonstrate that FLIP can: 1) perform model-based planning for different manipulation tasks; 2) synthesize long-horizon videos ( 200 frames); and 3) can guide the low-level policy for executing the plan for both simulation and real-world tasks. We also evaluate the action, dynamics, and value modules separately compared to corresponding baselines and show the interactive, zero-shot, scalability properties of FLIP.
Researcher Affiliation	Academia	Chongkai Gao National University of Singapore EMAIL Haozhuo Zhang Peking University EMAIL Zhixuan Xu, Zhehao Cai National University of Singapore EMAIL Lin Shao National University of Singapore EMAIL
Pseudocode	Yes	Algorithm 1 Flow-Centric Generative Planning
Open Source Code	No	The text does not contain an explicit statement of code release for the methodology described in the paper, nor a direct link to a code repository. It only mentions a website for video demos ('Video demos are on our website: https://nus-lins-lab.github.io/flipweb/').
Open Datasets	Yes	The first one is LIBERO-LONG (Liu et al., 2024a), a long-horizon table-top manipulation benchmark... The second one is the FMB benchmark (Luo et al., 2023)... as well as Bridge-V2 (Walke et al., 2023) as the evaluation benchmarks. We use LIBERO-10, Language-Table (Lynch et al., 2023), and Bridge-V2 (Walke et al., 2023) as the evaluation datasets. Finally, we train FLIP on LIBERO-90, a large-scale simulation manipulation dataset...
Dataset Splits	Yes	We train FLIP on 50 10 long-horizon videos with a resolution of 128 128 3 and test on 50 10 new random initializations. (from Section 5.1) and For Bridge-V2, we train on 10k videos and test on 256 videos with a resolution of 96 128 3. (from Section 5.2)
Hardware Specification	No	The paper mentions 'a 6-DOF X-arm as the robot arm, and use two Real Sense D435i cameras' for real-world experiments, but does not provide specific hardware details (like GPU/CPU models, memory, etc.) for training or simulation.
Software Dependencies	No	The paper mentions specific models like 'Llama 3.1 8B' and 'SDXL', but it does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	We report the hyperparameters of the models we trained in Table 5 and Table 6. We train all data with observation history equals to 16 and future flow horizon equals to 16. (from Section B.1) and Tables 5 and 6 detail hyperparameters like 'Encoder Layer', 'Decoder Layer', 'Hideen Size', 'Learning Rate', 'Image Patch Size', 'Head Number', and 'Layers'.