reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Authors: Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our framework is competitively compared with previous methods that utilize external information, paving the way for more efficient and scalable MLLMs. Experimental results confirm that our approach significantly enhances the model s performance across multiple benchmarks, encompassing both generative and discriminative tasks. We conduct a comprehensive ablation study using the LLa VA-1.5-7B (θ0) model. ... All results are listed in Table 1.
Researcher Affiliation	Collaboration	1South China University of Technology 2JD Explore Academy, Beijing 3Pazhou Lab, Guangzhou EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes using descriptive text and diagrams (e.g., Fig. 1 and Fig. 2), but does not contain any formally structured pseudocode or algorithm blocks with numbered steps and code-like formatting.
Open Source Code	Yes	Code https://github.com/Wentao Tan/SENA
Open Datasets	Yes	The dataset D is sourced from the LLa VA665k SFT dataset (Liu et al. 2024a), which includes COCO (Lin et al. 2014), GQA (Hudson and Manning 2019), Text VQA (Singh et al. 2019), OCRVQA (Mishra et al. 2019), and Visual Genome (Krishna et al. 2017), totaling 665K images.
Dataset Splits	No	The paper mentions sourcing data from various datasets and using a sample for its iterative process ('We plan to use approximately 1% of D per iteration, amounting to M = 6K images. Only images are used without annotation, resulting in a final random sample of 18K images.'). However, it does not explicitly provide the training, validation, or test splits for the evaluation benchmarks used to report results (LLaVAW, MM-Vet, MMHal-Bench, AMBER, MMBench).
Hardware Specification	No	The paper describes the initial model (LLa VA-1.5-vicuna-7B) and training parameters but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using LLa VA-1.5-vicuna-7B as the initial model, but it does not specify versions for other key software components, libraries, or programming languages required for reproducibility (e.g., Python version, PyTorch version, etc.).
Experiment Setup	Yes	Each iteration consists of 1 epoch with a batch size of 128 and a learning rate of 2e-6. The number of diffusion noise additions, T, is set to 600, and the scaling parameter β in DPO is fixed at 0.1.