FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model
Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first demonstrate that FLIP can: 1) perform model-based planning for different manipulation tasks; 2) synthesize long-horizon videos ( 200 frames); and 3) can guide the low-level policy for executing the plan for both simulation and real-world tasks. We also evaluate the action, dynamics, and value modules separately compared to corresponding baselines and show the interactive, zero-shot, scalability properties of FLIP. |
| Researcher Affiliation | Academia | Chongkai Gao National University of Singapore EMAIL Haozhuo Zhang Peking University EMAIL Zhixuan Xu, Zhehao Cai National University of Singapore EMAIL Lin Shao National University of Singapore EMAIL |
| Pseudocode | Yes | Algorithm 1 Flow-Centric Generative Planning |
| Open Source Code | No | The text does not contain an explicit statement of code release for the methodology described in the paper, nor a direct link to a code repository. It only mentions a website for video demos ('Video demos are on our website: https://nus-lins-lab.github.io/flipweb/'). |
| Open Datasets | Yes | The first one is LIBERO-LONG (Liu et al., 2024a), a long-horizon table-top manipulation benchmark... The second one is the FMB benchmark (Luo et al., 2023)... as well as Bridge-V2 (Walke et al., 2023) as the evaluation benchmarks. We use LIBERO-10, Language-Table (Lynch et al., 2023), and Bridge-V2 (Walke et al., 2023) as the evaluation datasets. Finally, we train FLIP on LIBERO-90, a large-scale simulation manipulation dataset... |
| Dataset Splits | Yes | We train FLIP on 50 10 long-horizon videos with a resolution of 128 128 3 and test on 50 10 new random initializations. (from Section 5.1) and For Bridge-V2, we train on 10k videos and test on 256 videos with a resolution of 96 128 3. (from Section 5.2) |
| Hardware Specification | No | The paper mentions 'a 6-DOF X-arm as the robot arm, and use two Real Sense D435i cameras' for real-world experiments, but does not provide specific hardware details (like GPU/CPU models, memory, etc.) for training or simulation. |
| Software Dependencies | No | The paper mentions specific models like 'Llama 3.1 8B' and 'SDXL', but it does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We report the hyperparameters of the models we trained in Table 5 and Table 6. We train all data with observation history equals to 16 and future flow horizon equals to 16. (from Section B.1) and Tables 5 and 6 detail hyperparameters like 'Encoder Layer', 'Decoder Layer', 'Hideen Size', 'Learning Rate', 'Image Patch Size', 'Head Number', and 'Layers'. |