Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Authors: Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Tingting Gao, Di ZHANG, Xi Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC. [...] We evaluate the performance of our method on five image question-answering benchmarks, as shown in Table 1 [...] For instruction-following benchmarks, POPE (Li et al., 2023b) evaluates the degree of hallucination in model responses on three sampled subsets of COCO (Lin et al., 2014)... |
| Researcher Affiliation | Collaboration | Longrong Yang1, Dong Shen2, Chaoxiang Cai3, Fan Yang2, Tingting Gao2, Di Zhang2, Xi Li1, 1College of Computer Science and Technology, Zhejiang University 2Kuaishou Technology 3School of Software Technology, Zhejiang University |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the main text of the paper. The methodology is described in Section 3, 'METHODOLOGY', and illustrated in Figure 2. |
| Open Source Code | Yes | The code will be publicly available at https://github.com/longrongyang/STGC. |
| Open Datasets | Yes | Benchmark: Some academic-task-oriented and instruction-following benchmarks are collected for evaluating the LVLM. For academic-task-oriented benchmarks, VQA-v2 (Goyal et al., 2017b) and GQA (Hudson & Manning, 2019) assess the visual perception capabilities of models through open-ended short answers. Viz Wiz (Gurari et al., 2018) evaluates the zero-shot generalization of models on visual questions asked by visually impaired people. Science QA (Lu et al., 2022), a multiple-choice benchmark, evaluates the zero-shot generalization of models on scientific question answering. Text VQA (Singh et al., 2019a) focuses on text-rich visual question answering tasks. Chart QA (Masry et al., 2022) focuses on visual and logical reasoning tasks over charts. Doc VQA (Mathew et al., 2021) focuses on reading comprehension tasks over document images. [...] Training Datasets: We use LLa VA-mix-665k (Liu et al., 2023b) as instruction tuning training data to conduct most experiments. [...] To verify the scalability of the STGC model, we conducted experiments using 1021K data from the Open-LLa VA-Ne XT dataset (Lin & Long, 2024). |
| Dataset Splits | Yes | Benchmark: Some academic-task-oriented and instruction-following benchmarks are collected for evaluating the LVLM. For academic-task-oriented benchmarks, VQA-v2 (Goyal et al., 2017b) and GQA (Hudson & Manning, 2019) assess the visual perception capabilities of models through open-ended short answers. [...] Training Datasets: We use LLa VA-mix-665k (Liu et al., 2023b) as instruction tuning training data to conduct most experiments. [...] As shown in Figure 5, we select three validation datasets, i.e., SQA (Lu et al., 2022), Text VQA (Singh et al., 2019a), and MMBench (Liu et al., 2023d), to analyze expert loading and activated pathways. |
| Hardware Specification | Yes | GPU Precision Instruction Tuning 8 A800-80G Bf16 (Table 6). [...] As shown in Table 11, we have reduced the additional time overhead to about 20% through some engineering tricks (e.g., only computing the token-level gradient on the bias and freezing parameters that do not require gradient computation). |
| Software Dependencies | No | First, during the forward pass of a batch, we freeze the parameters except for the biases and compute the main loss. Then, we perform a backward pass, using the Operator call for per sample grads provided by Py Torch to capture the gradients g1 n RD and g2 n RD of the token tn on the biases. [...] Similar to Mo CLE (Gou et al., 2023), we encode all the instructions of different datasets using the all-Mini LM-L6-v2 variant of the Sentence Transformer model (Reimers, 2019) and cluster their embeddings via K-means clustering algorithm. |
| Experiment Setup | Yes | Our training scheme follows Mo E-LLa VA (Lin et al., 2024). The details are presented in Table 6. During instruction fine-tuning, we use a batch size of 128 and a learning rate of 2e-5. We directly use the pre-trained models from Mo E-LLa VA (Lin et al., 2024) to conduct instruction tuning. [...] α=0.01, following Mo E-LLa VA. [...] Table 6: Hyper-parameters in training. Epoch 1 Learning rate 2e-5 Learning rate schedule Cosine Weight decay 0.0 Text max length 2048 Batch size per GPU 16 GPU Precision Bf16. |