Parrot: Multilingual Visual Instruction Tuning
Authors: Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the PARROT s state-of-the-art performance across two multilingual benchmarks, surpassing Qwen2-VL and LLa VAOne Vision in multiple languages. Additionally, we evaluate our model across a broad range of multimodal tasks (e.g., MME (Fu et al., 2023) and Science QA (Lu et al., 2022), and SEED-Bench (Li et al., 2024a)), demonstrating its competitive performance in diverse tasks. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, Nanjing University 3AI Business, Alibaba Group. |
| Pseudocode | Yes | The entire training process of PARROT is outlined in pseudocode, as shown in Algorithm 1 in the Appendix. |
| Open Source Code | Yes | Code and dataset are available at: https: //github.com/AIDC-AI/Parrot. |
| Open Datasets | Yes | To address the scarcity of current multilingual benchmarks, we introduce a new benchmark encompassing six languages: English, Chinese, Portuguese, Arabic, Turkish, and Russian. This includes an extension of the MMBench dataset to six languages and a Massive Multilingual Multimodal Benchmark (MMMB) featuring 2,000 evaluation questions per language, totaling 12,000 questions. To this end, we propose a novel method, PARROT, which uses textual guidance to align visual tokens at the language level. Table 3. Details on the PARROT s training data, which is derived from publicly available datasets. Stage 1 LLa VA-1.5-pretrain (Liu et al., 2023b), Laion-Caption (Schuhmann et al., 2022), CC12M-Caption (Changpinyo et al., 2021) Stage 2 LLa VA-1.5-finetune (Liu et al., 2023b), Share GPT4V-zh (Chen et al., 2023b), Share GPT4V-pt (Chen et al., 2023b), Share GPT4V-ar (Chen et al., 2023b), Share GPT4V-tr (Chen et al., 2023b), Share GPT4V-ru (Chen et al., 2023b) |
| Dataset Splits | Yes | This includes an extension of the MMBench dataset to six languages and a Massive Multilingual Multimodal Benchmark (MMMB) featuring 2,000 evaluation questions per language, totaling 12,000 questions. To address data scarcity in non-English languages, we use a semi-automatic method similar to the one depicted in Figure 3 to acquire image-text data. We randomly partition the Share GPT4V dataset (Chen et al., 2023b) for each language, extracting non-duplicate, non-parallel image-text pairs for training, ultimately obtaining nearly 10K samples per language. |
| Hardware Specification | Yes | The entire training process is optimized to 21 hours on the 16 A100 GPUs setup, benefiting from the relatively small training datasets. |
| Software Dependencies | No | The paper mentions "Deepspeed Zero2 Zero3" in Table 13 but does not provide specific version numbers for this or any other software component like programming languages or libraries. |
| Experiment Setup | Yes | The initial learning rates for the two stages are set at 1e 3 and 2e 5, respectively, with the batch size of 256 and 128. The entire training process is optimized to 21 hours on the 16 A100 GPUs setup, benefiting from the relatively small training datasets. Additionally, BF16 and TF32 precision formats are employed to balance speed and accuracy throughout the training process. Table 13. The detailed training hyperparameters. Config Stage 1 Stage 2 Epoch 1 Optimizer Adam W Learning rate 1e-3 2e-5 Learning rate scheduler Cosine Weight decay 0.0 Text max length 2048 Batch size per GPU 16 8 GPU 16 A100-80G Precision Bf16 Gradient checkpoint True |