MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Authors: Junpeng Yue, Xinrun Xu, Börje F. Karlsson, Zongqing Lu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness.
Researcher Affiliation Collaboration Junpeng Yue1 , Xinrun Xu2, B orje F. Karlsson3, and Zongqing Lu1 1School of Computer Science, Peking University 2Institute of Software, Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence
Pseudocode Yes H ALGORITHM MART s Agent Execution Pseudocode is shown in Algorithm 1. algorithm 1 MART Agent Execution Pseudocode
Open Source Code Yes All the code for benchmark tasks, simulator modifications and the MLLM retriever is available at https://github.com/PKU-RL/MART.
Open Datasets Yes To validate the effectiveness of our method in various environments, we perform evaluations on multiple scenarios in two environments, AI2-THOR (Kolve et al., 2017) and LEGENT (Cheng et al., 2024).
Dataset Splits Yes There are 45 tasks comprising a total of 260 sub-tasks in training set, and 28 tasks including 158 sub-tasks in testing set. ... To train the retriever, we use 40 tasks (10 tasks for each task type) and we use 32 tasks, also covering all task types, as test set.
Hardware Specification No No specific hardware details (like GPU or CPU models, memory, or cloud instances with specs) are mentioned in the paper for running the experiments.
Software Dependencies Yes LLa VA version llava-v1.6-mistral-7b
Experiment Setup Yes Table 6: Hyperparameters of LLa VA fine-tuned by Lo RA Hyperparameters Value LLa VA version llava-v1.6-mistral-7b train batch size 32 eval batch size 8 gradient accumulation steps 8 learning rate AI2THOR 2e-5 mm projector lr AI2THOR 2e-5 learning rate LEGENT 3e-6 mm projector lr LEGENT 3e-6 lora r 16 lora alpha 32 warmup ratio 0.05 model max length 32768 lr scheduler type cosine vision tower clip-vit-large-patch14-336