Scaling Vision-and-Language Navigation With Offline RL

Authors: Valay Bundele, Mahesh Bhupati, Biplab Banerjee, Aditya Grover

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements, even in complex and intricate environments.1
Researcher Affiliation Academia Valay Bundele EMAIL University of Tübingen Mahesh Bhupati EMAIL Indian Institute of Technology Bombay Biplab Banerjee EMAIL Indian Institute of Technology Bombay Aditya Grover EMAIL University of California, Los Angeles
Pseudocode Yes Algorithm 1 describes reward-token conditioning in detail. Algorithm 1: Reward token conditioning Training Phase: Input: Instruction I, Visual features Vt, State token qt 1, Ground-truth action at, Current state st, Next state st+1, Goal location G Output: Trained policy model M parameters
Open Source Code Yes Code and datasets available at https://github.com/Valaybundele/Reward C-VLN-ORL
Open Datasets Yes We will open source our datasets for wider use by the community. We have created two versions of each dataset D which we generated by rolling out HAMT: 1) D-R2R, generated using train set of R2R, and 2) D-Rx R, generated using train set of Rx R.
Dataset Splits Yes The R2R dataset has 14,025 instructions in the train set and 4,173 instructions in the test set. The validation set is further divided into val-seen and val-unseen, having 1,020 and 2,349 instructions respectively. We use English subset of Rx R which includes 26,464 path-instruction pairs in train set, 2,939 pairs in the val-seen set and 4,551 pairs in the val-unseen set.
Hardware Specification Yes The experiments were performed on a NVIDIA A100 GPU.
Software Dependencies No The paper mentions using Adam optimizer and ResNet-152, but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We used Adam optimizer with a learning rate of 1e-5 to train the models. The batch size was kept as 64 and the models were trained for 500K iterations. We trained all the models from scratch in the offline RL setup.