reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Parrot: Multilingual Visual Instruction Tuning

Authors: Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the PARROT s state-of-the-art performance across two multilingual benchmarks, surpassing Qwen2-VL and LLa VAOne Vision in multiple languages. Additionally, we evaluate our model across a broad range of multimodal tasks (e.g., MME (Fu et al., 2023) and Science QA (Lu et al., 2022), and SEED-Bench (Li et al., 2024a)), demonstrating its competitive performance in diverse tasks.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence, Nanjing University 3AI Business, Alibaba Group.
Pseudocode	Yes	The entire training process of PARROT is outlined in pseudocode, as shown in Algorithm 1 in the Appendix.
Open Source Code	Yes	Code and dataset are available at: https: //github.com/AIDC-AI/Parrot.
Open Datasets	Yes	To address the scarcity of current multilingual benchmarks, we introduce a new benchmark encompassing six languages: English, Chinese, Portuguese, Arabic, Turkish, and Russian. This includes an extension of the MMBench dataset to six languages and a Massive Multilingual Multimodal Benchmark (MMMB) featuring 2,000 evaluation questions per language, totaling 12,000 questions. To this end, we propose a novel method, PARROT, which uses textual guidance to align visual tokens at the language level. Table 3. Details on the PARROT s training data, which is derived from publicly available datasets. Stage 1 LLa VA-1.5-pretrain (Liu et al., 2023b), Laion-Caption (Schuhmann et al., 2022), CC12M-Caption (Changpinyo et al., 2021) Stage 2 LLa VA-1.5-finetune (Liu et al., 2023b), Share GPT4V-zh (Chen et al., 2023b), Share GPT4V-pt (Chen et al., 2023b), Share GPT4V-ar (Chen et al., 2023b), Share GPT4V-tr (Chen et al., 2023b), Share GPT4V-ru (Chen et al., 2023b)
Dataset Splits	Yes	This includes an extension of the MMBench dataset to six languages and a Massive Multilingual Multimodal Benchmark (MMMB) featuring 2,000 evaluation questions per language, totaling 12,000 questions. To address data scarcity in non-English languages, we use a semi-automatic method similar to the one depicted in Figure 3 to acquire image-text data. We randomly partition the Share GPT4V dataset (Chen et al., 2023b) for each language, extracting non-duplicate, non-parallel image-text pairs for training, ultimately obtaining nearly 10K samples per language.
Hardware Specification	Yes	The entire training process is optimized to 21 hours on the 16 A100 GPUs setup, benefiting from the relatively small training datasets.
Software Dependencies	No	The paper mentions "Deepspeed Zero2 Zero3" in Table 13 but does not provide specific version numbers for this or any other software component like programming languages or libraries.
Experiment Setup	Yes	The initial learning rates for the two stages are set at 1e 3 and 2e 5, respectively, with the batch size of 256 and 128. The entire training process is optimized to 21 hours on the 16 A100 GPUs setup, benefiting from the relatively small training datasets. Additionally, BF16 and TF32 precision formats are employed to balance speed and accuracy throughout the training process. Table 13. The detailed training hyperparameters. Config Stage 1 Stage 2 Epoch 1 Optimizer Adam W Learning rate 1e-3 2e-5 Learning rate scheduler Cosine Weight decay 0.0 Text max length 2048 Batch size per GPU 16 8 GPU 16 A100-80G Precision Bf16 Gradient checkpoint True