Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Authors: Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our framework is competitively compared with previous methods that utilize external information, paving the way for more efficient and scalable MLLMs. Experimental results confirm that our approach significantly enhances the model s performance across multiple benchmarks, encompassing both generative and discriminative tasks. We conduct a comprehensive ablation study using the LLa VA-1.5-7B (θ0) model. ... All results are listed in Table 1.
Researcher Affiliation Collaboration 1South China University of Technology 2JD Explore Academy, Beijing 3Pazhou Lab, Guangzhou EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and processes using descriptive text and diagrams (e.g., Fig. 1 and Fig. 2), but does not contain any formally structured pseudocode or algorithm blocks with numbered steps and code-like formatting.
Open Source Code Yes Code https://github.com/Wentao Tan/SENA
Open Datasets Yes The dataset D is sourced from the LLa VA665k SFT dataset (Liu et al. 2024a), which includes COCO (Lin et al. 2014), GQA (Hudson and Manning 2019), Text VQA (Singh et al. 2019), OCRVQA (Mishra et al. 2019), and Visual Genome (Krishna et al. 2017), totaling 665K images.
Dataset Splits No The paper mentions sourcing data from various datasets and using a sample for its iterative process ('We plan to use approximately 1% of D per iteration, amounting to M = 6K images. Only images are used without annotation, resulting in a final random sample of 18K images.'). However, it does not explicitly provide the training, validation, or test splits for the *evaluation benchmarks* used to report results (LLaVAW, MM-Vet, MMHal-Bench, AMBER, MMBench).
Hardware Specification No The paper describes the initial model (LLa VA-1.5-vicuna-7B) and training parameters but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using LLa VA-1.5-vicuna-7B as the initial model, but it does not specify versions for other key software components, libraries, or programming languages required for reproducibility (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes Each iteration consists of 1 epoch with a batch size of 128 and a learning rate of 2e-6. The number of diffusion noise additions, T, is set to 600, and the scaling parameter β in DPO is fixed at 0.1.