Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Authors: Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Tingting Gao, Di ZHANG, Xi Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC. [...] We evaluate the performance of our method on five image question-answering benchmarks, as shown in Table 1 [...] For instruction-following benchmarks, POPE (Li et al., 2023b) evaluates the degree of hallucination in model responses on three sampled subsets of COCO (Lin et al., 2014)...
Researcher Affiliation Collaboration Longrong Yang1, Dong Shen2, Chaoxiang Cai3, Fan Yang2, Tingting Gao2, Di Zhang2, Xi Li1, 1College of Computer Science and Technology, Zhejiang University 2Kuaishou Technology 3School of Software Technology, Zhejiang University
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the main text of the paper. The methodology is described in Section 3, 'METHODOLOGY', and illustrated in Figure 2.
Open Source Code Yes The code will be publicly available at https://github.com/longrongyang/STGC.
Open Datasets Yes Benchmark: Some academic-task-oriented and instruction-following benchmarks are collected for evaluating the LVLM. For academic-task-oriented benchmarks, VQA-v2 (Goyal et al., 2017b) and GQA (Hudson & Manning, 2019) assess the visual perception capabilities of models through open-ended short answers. Viz Wiz (Gurari et al., 2018) evaluates the zero-shot generalization of models on visual questions asked by visually impaired people. Science QA (Lu et al., 2022), a multiple-choice benchmark, evaluates the zero-shot generalization of models on scientific question answering. Text VQA (Singh et al., 2019a) focuses on text-rich visual question answering tasks. Chart QA (Masry et al., 2022) focuses on visual and logical reasoning tasks over charts. Doc VQA (Mathew et al., 2021) focuses on reading comprehension tasks over document images. [...] Training Datasets: We use LLa VA-mix-665k (Liu et al., 2023b) as instruction tuning training data to conduct most experiments. [...] To verify the scalability of the STGC model, we conducted experiments using 1021K data from the Open-LLa VA-Ne XT dataset (Lin & Long, 2024).
Dataset Splits Yes Benchmark: Some academic-task-oriented and instruction-following benchmarks are collected for evaluating the LVLM. For academic-task-oriented benchmarks, VQA-v2 (Goyal et al., 2017b) and GQA (Hudson & Manning, 2019) assess the visual perception capabilities of models through open-ended short answers. [...] Training Datasets: We use LLa VA-mix-665k (Liu et al., 2023b) as instruction tuning training data to conduct most experiments. [...] As shown in Figure 5, we select three validation datasets, i.e., SQA (Lu et al., 2022), Text VQA (Singh et al., 2019a), and MMBench (Liu et al., 2023d), to analyze expert loading and activated pathways.
Hardware Specification Yes GPU Precision Instruction Tuning 8 A800-80G Bf16 (Table 6). [...] As shown in Table 11, we have reduced the additional time overhead to about 20% through some engineering tricks (e.g., only computing the token-level gradient on the bias and freezing parameters that do not require gradient computation).
Software Dependencies No First, during the forward pass of a batch, we freeze the parameters except for the biases and compute the main loss. Then, we perform a backward pass, using the Operator call for per sample grads provided by Py Torch to capture the gradients g1 n RD and g2 n RD of the token tn on the biases. [...] Similar to Mo CLE (Gou et al., 2023), we encode all the instructions of different datasets using the all-Mini LM-L6-v2 variant of the Sentence Transformer model (Reimers, 2019) and cluster their embeddings via K-means clustering algorithm.
Experiment Setup Yes Our training scheme follows Mo E-LLa VA (Lin et al., 2024). The details are presented in Table 6. During instruction fine-tuning, we use a batch size of 128 and a learning rate of 2e-5. We directly use the pre-trained models from Mo E-LLa VA (Lin et al., 2024) to conduct instruction tuning. [...] α=0.01, following Mo E-LLa VA. [...] Table 6: Hyper-parameters in training. Epoch 1 Learning rate 2e-5 Learning rate schedule Cosine Weight decay 0.0 Text max length 2048 Batch size per GPU 16 GPU Precision Bf16.