reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SELU: Self-Learning Embodied Multimodal Large Language Models in Unknown Environments

Authors: Boyu Li, Haobin Jiang, Ziluo Ding, Xinrun Xu, Haoran Li, Dongbin Zhao, Zongqing Lu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method in the AI2-THOR and Virtual Home environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning. We demonstrate the effectiveness of SELU in AI2-THOR and Virtual Home, achieving critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24%, respectively. The paper includes Section 5 'Experiments', multiple tables (e.g., Table 1, 2, 3), and figures (e.g., Figure 3) presenting empirical results, comparisons to baselines, and ablation studies.
Researcher Affiliation	Academia	1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence 4School of Computer Science, Peking University 5Institute of Software, Chinese Academy of Sciences. Correspondence to Zongqing Lu <EMAIL>. All listed affiliations are academic or public research institutions, and the provided email address is from an academic domain.
Pseudocode	Yes	A.1 Pseudocode of SELU. algorithm 1 SELU
Open Source Code	No	The paper does not provide any explicit statements about the release of source code for the described methodology, nor does it include links to a code repository.
Open Datasets	Yes	Environments. In order to simulate embodied MLLM interactions in unknown environments, we select AI2-THOR (Kolve et al., 2022) and Virtual Home (Puig et al., 2018) for our experiments.
Dataset Splits	No	The paper describes an online self-learning process where data is collected through interaction with the environment and used for fine-tuning. While evaluation is performed, specific fixed training, validation, and test dataset splits with explicit percentages or sample counts for a static dataset are not provided. The statement 'we retain 30% of the last fine-tuning dataset each time and obtain the remaining 70% of the data through online interaction' refers to data usage during iterative fine-tuning, not a conventional train/test split.
Hardware Specification	Yes	A.6 Computational Resource Costs: We run all experiments in 8 x A100 GPUs with 40GB memory.
Software Dependencies	No	The paper mentions models like 'LLaVA-V1.6-Mistral-7B' and 'Qwen-VL' and fine-tuning with 'Lo RA', but it does not specify versions for core software dependencies such as the programming language (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	The specific MLLMs we use are LLa VA-V1.6-Mistral-7B and Qwen-VL. We use Lo RA to fine-tune them, the hyperparameters are as follows. (This is followed by Table 7 and Table 8 detailing hyperparameters such as Train_batch_size, Learning_rate_actor, Learning_rate_critic, Warmup_ratio, Weight_decay, Model_max_length, etc.). Both models are configured with a temperature of 0 and a maximum token length of 2048 for response generation. The maximum number of environment steps is set to 10.