Instruction-Augmented Long-Horizon Planning: Embedding Grounding Mechanisms in Embodied Mobile Manipulation
Authors: Fangyuan Wang, Shipeng Lyu, Peng Zhou, Anqing Duan, Guodong Guo, David Navarro-Alarcon
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By conducting various real-world long-horizon tasks, each consisting of seven distinct manipulatory skills, our results demonstrate that the IALP system can efficiently solve these tasks with an average success rate exceeding 80%. Our proposed method can operate as a high-level planner, equipping robots with substantial autonomy in unstructured environments through the utilization of multi-modal sensor inputs. |
| Researcher Affiliation | Academia | 1 The Hong Kong Polytechnic University 2 Ningbo Institute of Digital Twin, Eastern Institute of Technology 3 Great Bay University 4 Mohamed Bin Zayed University of Artificial Intelligence EMAIL, EMAIL |
| Pseudocode | No | The paper describes the system and methods using textual descriptions and tables (Table 2 lists actions and their preconditions/effects) but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | Website https://nicehiro.github.io/IALP. This provides a general website for the project, not a direct link to a source code repository for the methodology described in the paper. |
| Open Datasets | No | The paper describes conducting experiments in real-world environments using a robot and mentions using foundation models like SAM, CLIP, Lang SAM, and Grasp Net. It refers to building a 'pre-built voxel map' and 'pre-computed reachability map' internally for its system, but does not explicitly state the use or provision of any publicly available datasets. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages, sample counts for training/validation/test sets). It describes conducting 'five long-horizon tasks' in real-world settings and for an ablation study, it states: 'We generated 20 sets of observations for each task in different settings. The predicates in each set of observations were randomly assigned, i.e., set to True with a probability of 50%.' This is a description of observation generation for an ablation, not a standard dataset split. |
| Hardware Specification | Yes | The experiments were carried out using the Tiago++ (Pages, Marchionni, and Ferro 2016) mobile robot, which is equipped with wheels and two 7-degree-of-freedom arms. The robot features an integrated RGB-D camera on its head, which is utilized for perception throughout the tasks. For motion planning of the robotic arms, we employed Pinocchio (Carpentier et al. 2019). We used three computers: one for controlling the robot with the ROS Noetic system and others for generating navigation manipulation feedback. |
| Software Dependencies | Yes | High-level planning and promptable predicate checking were performed using the gpt-4o (Achiam et al. 2023) model. For motion planning of the robotic arms, we employed Pinocchio (Carpentier et al. 2019). We used three computers: one for controlling the robot with the ROS Noetic system. We utilize the Lang SAM (Medeiros 2023) model, which extracts the desired object’s mask based on the object language description lo and the image captured by the robot. A pre-trained Grasp Net (Fang et al. 2023) is used to generate several potential grasps for the robot, with Lang SAM masks filtering the object-related grasps. |
| Experiment Setup | Yes | High-level planning and promptable predicate checking were performed using the gpt-4o (Achiam et al. 2023) model. We generated 20 sets of observations for each task in different settings. The predicates in each set of observations were randomly assigned, i.e., set to True with a probability of 50%. Seven prompts are used for promptable predicates identification, PDDL state and goal generation, and task planning. The robot employed the adjust action to modify its position, head tilt, and pane orientation several times to make the manipulated object feasible to grasp if certain manipulation feasibility predicates were not met. |