SELU: Self-Learning Embodied Multimodal Large Language Models in Unknown Environments
Authors: Boyu Li, Haobin Jiang, Ziluo Ding, Xinrun Xu, Haoran Li, Dongbin Zhao, Zongqing Lu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in the AI2-THOR and Virtual Home environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning. We demonstrate the effectiveness of SELU in AI2-THOR and Virtual Home, achieving critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24%, respectively. The paper includes Section 5 'Experiments', multiple tables (e.g., Table 1, 2, 3), and figures (e.g., Figure 3) presenting empirical results, comparisons to baselines, and ablation studies. |
| Researcher Affiliation | Academia | 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence 4School of Computer Science, Peking University 5Institute of Software, Chinese Academy of Sciences. Correspondence to Zongqing Lu <EMAIL>. All listed affiliations are academic or public research institutions, and the provided email address is from an academic domain. |
| Pseudocode | Yes | A.1 Pseudocode of SELU. algorithm 1 SELU |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code for the described methodology, nor does it include links to a code repository. |
| Open Datasets | Yes | Environments. In order to simulate embodied MLLM interactions in unknown environments, we select AI2-THOR (Kolve et al., 2022) and Virtual Home (Puig et al., 2018) for our experiments. |
| Dataset Splits | No | The paper describes an online self-learning process where data is collected through interaction with the environment and used for fine-tuning. While evaluation is performed, specific fixed training, validation, and test dataset splits with explicit percentages or sample counts for a static dataset are not provided. The statement 'we retain 30% of the last fine-tuning dataset each time and obtain the remaining 70% of the data through online interaction' refers to data usage during iterative fine-tuning, not a conventional train/test split. |
| Hardware Specification | Yes | A.6 Computational Resource Costs: We run all experiments in 8 x A100 GPUs with 40GB memory. |
| Software Dependencies | No | The paper mentions models like 'LLaVA-V1.6-Mistral-7B' and 'Qwen-VL' and fine-tuning with 'Lo RA', but it does not specify versions for core software dependencies such as the programming language (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | The specific MLLMs we use are LLa VA-V1.6-Mistral-7B and Qwen-VL. We use Lo RA to fine-tune them, the hyperparameters are as follows. (This is followed by Table 7 and Table 8 detailing hyperparameters such as Train_batch_size, Learning_rate_actor, Learning_rate_critic, Warmup_ratio, Weight_decay, Model_max_length, etc.). Both models are configured with a temperature of 0 and a maximum token length of 2048 for response generation. The maximum number of environment steps is set to 10. |