reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

General Scene Adaptation for Vision-and-Language Navigation

Authors: Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods, revealing key factors enabling agents to adapt to specific environments. Based on our findings, we propose a novel method, Graph-Retained DUET (GR-DUET), which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.
Researcher Affiliation	Collaboration	1The University of Queensland, 2CSIRO Data61, 3The University of Adelaide EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes methods and pipelines in prose and uses flowcharts (e.g., Fig. 3 for instruction orchestration) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The dataset and code are available at https://github.com/honghd16/GSA-VLN.
Open Datasets	Yes	To this end, we introduce the GSA-R2R dataset, which provides a comprehensive collection of environments and instructions for evaluating agent performance in both ID and OOD contexts within a single scene... The dataset and code are available at https://github.com/honghd16/GSA-VLN. We incorporate buildings from the Habitat-Matterport3D (HM3D) dataset (Ramakrishnan et al., 2021), which offers a broader range and greater number of photorealistic environments compared to the MP3D dataset...
Dataset Splits	Yes	Since our focus is on scene adaptation after the training phase, we include only evaluation splits and use the training split of R2R for the GSA-R2R dataset. Given the two general building types (residential and non-residential) and three types of instructions, we design five splits for both validation and testing, with their details in the appendix. The splits are named using the format Val/Test R/N-Basic/Scene/User .
Hardware Specification	Yes	GR-DUET requires a peak of 4.3 GB of GPU memory during inference, which is well within the capacity of modern GPUs. For instance, it can be deployed on terminal servers equipped with hardware like the NVIDIA Jetson AGX Orin or similar devices.
Software Dependencies	No	The paper mentions using GPT-4 and GPT-4o for generating text and CLIP-ViT/B-16 as a visual feature extractor. However, it does not provide specific version numbers for software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used for implementation.
Experiment Setup	Yes	In GR-DUET, we set the maximum number of episodes α = 50. The best model is selected based on the average SPL across all validation splits. For each adaptation method, we conduct the evaluation three times with randomly sequenced instructions and report the mean and standard error for each metric. ... we use CLIP-Vi T/B-16 (Radford et al., 2021) as the visual feature extractor for both the navigation and speaker models for fair comparison. Second, all models are evaluated using a batch size of 1 in an online manner during evaluation.