reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Authors: Junpeng Yue, Xinrun Xu, Börje F. Karlsson, Zongqing Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness.
Researcher Affiliation	Collaboration	Junpeng Yue1 , Xinrun Xu2, B orje F. Karlsson3, and Zongqing Lu1 1School of Computer Science, Peking University 2Institute of Software, Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence
Pseudocode	Yes	H ALGORITHM MART s Agent Execution Pseudocode is shown in Algorithm 1. algorithm 1 MART Agent Execution Pseudocode
Open Source Code	Yes	All the code for benchmark tasks, simulator modifications and the MLLM retriever is available at https://github.com/PKU-RL/MART.
Open Datasets	Yes	To validate the effectiveness of our method in various environments, we perform evaluations on multiple scenarios in two environments, AI2-THOR (Kolve et al., 2017) and LEGENT (Cheng et al., 2024).
Dataset Splits	Yes	There are 45 tasks comprising a total of 260 sub-tasks in training set, and 28 tasks including 158 sub-tasks in testing set. ... To train the retriever, we use 40 tasks (10 tasks for each task type) and we use 32 tasks, also covering all task types, as test set.
Hardware Specification	No	No specific hardware details (like GPU or CPU models, memory, or cloud instances with specs) are mentioned in the paper for running the experiments.
Software Dependencies	Yes	LLa VA version llava-v1.6-mistral-7b
Experiment Setup	Yes	Table 6: Hyperparameters of LLa VA fine-tuned by Lo RA Hyperparameters Value LLa VA version llava-v1.6-mistral-7b train batch size 32 eval batch size 8 gradient accumulation steps 8 learning rate AI2THOR 2e-5 mm projector lr AI2THOR 2e-5 learning rate LEGENT 3e-6 mm projector lr LEGENT 3e-6 lora r 16 lora alpha 32 warmup ratio 0.05 model max length 32768 lr scheduler type cosine vision tower clip-vit-large-patch14-336