MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
Authors: Junpeng Yue, Xinrun Xu, Börje F. Karlsson, Zongqing Lu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. |
| Researcher Affiliation | Collaboration | Junpeng Yue1 , Xinrun Xu2, B orje F. Karlsson3, and Zongqing Lu1 1School of Computer Science, Peking University 2Institute of Software, Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence |
| Pseudocode | Yes | H ALGORITHM MART s Agent Execution Pseudocode is shown in Algorithm 1. algorithm 1 MART Agent Execution Pseudocode |
| Open Source Code | Yes | All the code for benchmark tasks, simulator modifications and the MLLM retriever is available at https://github.com/PKU-RL/MART. |
| Open Datasets | Yes | To validate the effectiveness of our method in various environments, we perform evaluations on multiple scenarios in two environments, AI2-THOR (Kolve et al., 2017) and LEGENT (Cheng et al., 2024). |
| Dataset Splits | Yes | There are 45 tasks comprising a total of 260 sub-tasks in training set, and 28 tasks including 158 sub-tasks in testing set. ... To train the retriever, we use 40 tasks (10 tasks for each task type) and we use 32 tasks, also covering all task types, as test set. |
| Hardware Specification | No | No specific hardware details (like GPU or CPU models, memory, or cloud instances with specs) are mentioned in the paper for running the experiments. |
| Software Dependencies | Yes | LLa VA version llava-v1.6-mistral-7b |
| Experiment Setup | Yes | Table 6: Hyperparameters of LLa VA fine-tuned by Lo RA Hyperparameters Value LLa VA version llava-v1.6-mistral-7b train batch size 32 eval batch size 8 gradient accumulation steps 8 learning rate AI2THOR 2e-5 mm projector lr AI2THOR 2e-5 learning rate LEGENT 3e-6 mm projector lr LEGENT 3e-6 lora r 16 lora alpha 32 warmup ratio 0.05 model max length 32768 lr scheduler type cosine vision tower clip-vit-large-patch14-336 |