Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models
Authors: Shen Tan, Dong Zhou, Xiangyu Shao, Junqiao Wang, Guanghui Sun
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments simulated in complex household environments show strong zero-shot generalization and multi-task learning abilities of LOVMM. Moreover, our approach can also generalize to multiple tabletop manipulation tasks and achieve better success rates compared to other state-of-the-art methods. |
| Researcher Affiliation | Academia | Shen Tan , Dong Zhou , Xiangyu Shao , Junqiao Wang , Guanghui Sun Harbin Institute of Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed method and architecture using text, mathematical equations, and figures (Figure 2 and Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any code-like formatted steps. |
| Open Source Code | Yes | 1The source code, dataset, and supplementary material are available at: https://github.com/shentan-shiina/LOVMM. |
| Open Datasets | Yes | 1The source code, dataset, and supplementary material are available at: https://github.com/shentan-shiina/LOVMM. |
| Dataset Splits | Yes | The models are trained for 600K steps across all seen tasks using n = 1, 10, 100 expert demonstrations in multi-task settings following CLIPort benchmark [Shridhar et al., 2022]. Then we evaluate the models on 100 seen tasks and use the best validation model to test on 100 unseen tasks. |
| Hardware Specification | Yes | All models are trained on 4 NVIDIA RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions several models and frameworks used (e.g., GPT-4, VLMaps, CLIP, Transporter network, LSeg) but does not specify the version numbers of any underlying software libraries, programming languages, or specific ancillary tools used for implementation. |
| Experiment Setup | Yes | The models are trained for 600K steps across all seen tasks using n = 1, 10, 100 expert demonstrations in multi-task settings following CLIPort benchmark [Shridhar et al., 2022]. The model is trained with cross-entropy loss for 2D manipulation, and a Huber loss for 3D manipulation. We use c = 64, k = 36 and d = 3, and d = 24 for feature channel dimensions. |