reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Adaptation for Online Vision-Language Navigation with Feedback-based Reinforcement Learning

Authors: Sungjune Kim, Gyeongrok Oh, Heeju Ko, Daehyun Ji, Dongwook Lee, Byung-Jun Lee, Sujin Jang, Sangpil Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on challenging VLN benchmarks demonstrate the superior adaptability of FEEDTTA, even outperforming the stateof-the-art offline training methods in REVERIE benchmark with a single stream of learning.
Researcher Affiliation	Collaboration	1Department of AI, Korea University, Seoul, S.Korea 2Samsung AI Center, DS Division, Suwon, S.Korea.
Pseudocode	Yes	Algorithm 1 Online Learning Process of FEEDTTA
Open Source Code	No	The paper mentions re-implementing a baseline method (FSTTA) due to issues with its official code, and provides a link to the issue tracker for that third-party code. However, there is no explicit statement or link provided for the source code of the authors' own proposed method, FEEDTTA.
Open Datasets	Yes	We empirically demonstrate the effectiveness of the proposed method through extensive experiments on REVERIE (Qi et al., 2020), R2R (Anderson et al., 2018), and R2R-CE (Krantz et al., 2020) benchmark.
Dataset Splits	Yes	We empirically demonstrate the effectiveness of the proposed method through extensive experiments on REVERIE (Qi et al., 2020), R2R (Anderson et al., 2018), and R2R-CE (Krantz et al., 2020) benchmark. Specifically, for the REVERIE dataset, the results in the paper are obtained with p = 0.01 and α = 0.2 for the validation seen split, and p = 0.05 and α = 0.2 for the validation unseen split. For R2R and R2R-CE, we use p = 0.05 and α = 0.1 for both splits.
Hardware Specification	Yes	Lastly, all experiments are conducted on a single NVIDIA Tesla A100 GPU.
Software Dependencies	No	The paper mentions using 'GPT-4 model' as an LLM oracle, but does not specify any software libraries with version numbers (e.g., PyTorch, TensorFlow, Python, CUDA versions).
Experiment Setup	Yes	We use a batch size of 1 to properly simulate the online environment. Then, we search the best-performing values for the reversion rate p and the reversion magnitude α within {0.01, 0.05, 0.1, 0.2, 0.3} and {-0.01, -0.025, -0.05, -0.075, -0.1, -0.2, 0.3}, respectively. For the REVERIE dataset, the results in the paper are obtained with p = 0.01 and α = 0.2 for the validation seen split, and p = 0.05 and α = 0.2 for the validation unseen split. For R2R and R2R-CE, we use p = 0.05 and α = 0.1 for both splits. The learning rate η is set as 5e-6. All other hyperparameters adhere to the default configuration of the target policy.