Affordances-Oriented Planning Using Foundation Models for Continuous Vision-Language Navigation
Authors: Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiaodan Liang, Kwan-Yee K. Wong
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the challenging R2R-CE and Rx R-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Shenzhen Campus of Sun Yat-sen University 3Meituan |
| Pseudocode | No | The paper describes methods and processes verbally and through figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code. It mentions "For implementation details and more results, please refer to the appendices in our ar Xiv version paper (Chen et al. 2024a)" which refers to details and results, not code availability. |
| Open Datasets | Yes | We conduct experiments on the challenging R2RCE (Krantz et al. 2020) and Rx R-CE (Ku et al. 2020) datasets. R2R-CE is derived from the discrete path annotations from the R2R dataset (Anderson et al. 2018) and is converted into continuous environments with the Habitat simulator (Savva et al. 2019). |
| Dataset Splits | Yes | evaluating AO-Planner on the entire validation unseen set of R2R-CE and a random sampling subset with 500 cases from the validation unseen set of Rx R-CE. To save API costs, we also additionally sample a subset containing 100 cases from the validation unseen set of R2R-CE for ablation study. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using GPT-4, Gemini-1.5-Pro, Grounded SAM, Grounding Dino, and Segment Anything Model. However, it does not provide specific version numbers for these or any other software libraries or programming languages used for implementation. |
| Experiment Setup | Yes | In our framework, we set N = 4 and collect non-overlapping views from the front, back, left, and right directions as observation, i.e., Ot = {V i t }4 i=1. For the action space, the VLN-CE task defines four parameterized low-level actions, namely FORWARD (0.25m), ROTATE LEFT/RIGHT (15 ), and STOP. In the environment, we set the FOV of the agent s camera to 90 degrees and collect observations from four directions in counterclockwise order, namely front, left, back, and right views. |