General Scene Adaptation for Vision-and-Language Navigation
Authors: Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods, revealing key factors enabling agents to adapt to specific environments. Based on our findings, we propose a novel method, Graph-Retained DUET (GR-DUET), which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits. |
| Researcher Affiliation | Collaboration | 1The University of Queensland, 2CSIRO Data61, 3The University of Adelaide EMAIL, EMAIL EMAIL |
| Pseudocode | No | The paper describes methods and pipelines in prose and uses flowcharts (e.g., Fig. 3 for instruction orchestration) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The dataset and code are available at https://github.com/honghd16/GSA-VLN. |
| Open Datasets | Yes | To this end, we introduce the GSA-R2R dataset, which provides a comprehensive collection of environments and instructions for evaluating agent performance in both ID and OOD contexts within a single scene... The dataset and code are available at https://github.com/honghd16/GSA-VLN. We incorporate buildings from the Habitat-Matterport3D (HM3D) dataset (Ramakrishnan et al., 2021), which offers a broader range and greater number of photorealistic environments compared to the MP3D dataset... |
| Dataset Splits | Yes | Since our focus is on scene adaptation after the training phase, we include only evaluation splits and use the training split of R2R for the GSA-R2R dataset. Given the two general building types (residential and non-residential) and three types of instructions, we design five splits for both validation and testing, with their details in the appendix. The splits are named using the format Val/Test R/N-Basic/Scene/User . |
| Hardware Specification | Yes | GR-DUET requires a peak of 4.3 GB of GPU memory during inference, which is well within the capacity of modern GPUs. For instance, it can be deployed on terminal servers equipped with hardware like the NVIDIA Jetson AGX Orin or similar devices. |
| Software Dependencies | No | The paper mentions using GPT-4 and GPT-4o for generating text and CLIP-ViT/B-16 as a visual feature extractor. However, it does not provide specific version numbers for software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used for implementation. |
| Experiment Setup | Yes | In GR-DUET, we set the maximum number of episodes α = 50. The best model is selected based on the average SPL across all validation splits. For each adaptation method, we conduct the evaluation three times with randomly sequenced instructions and report the mean and standard error for each metric. ... we use CLIP-Vi T/B-16 (Radford et al., 2021) as the visual feature extractor for both the navigation and speaker models for fair comparison. Second, all models are evaluated using a batch size of 1 in an online manner during evaluation. |