Be Confident: Uncovering Overfitting in MLLM Multi-Task Tuning
Authors: Wenke Huang, Jian Liang, Guancheng Wan, Didi Zhu, He Li, Jiawei Shao, Mang Ye, Bo Du, Dacheng Tao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive empirical evaluations across diverse multi-task downstream via popular MLLM architectures. The comprehensive experiment demonstrates our effectiveness, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task performance. We perform a comprehensive analysis on multi-task scenarios, including both open-response datasets (Flickr30k (Young et al., 2014) and COCO-Cap (Lin et al., 2014)) and fixed-choice datasets (Science QA (Lu et al., 2022) and Icon QA(Lu et al., 2021)). Experiments are conducted on the VILA (Lin et al., 2023) and LLa VA (Liu et al., 2023b) architectures. Through a series of ablation studies, the promising results empirically validate the effectiveness of NRCA in alleviating open-response overfitting and enhancing multi-task fine-tuning overall performance. |
| Researcher Affiliation | Collaboration | 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China 2Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Institute of Artificial Intelligence (Tele AI), China Telecom, China 4Nanyang Technological University, Singapore. Correspondence to: Mang Ye <EMAIL>, Bo Du <EMAIL>. |
| Pseudocode | Yes | We provide the algorithm description in Algorithm 1. Algorithm 1 NRCA Input: Fine-Tuning epoch E, Downstream dataset D Overall MLLM Network θ, Trainable parameter module w Output: The optimized selected MLLM module w for e = 1, 2, ..., E do for (xv, xt, y) D do /* Construct Noisy Vision View */ µ, σ (xv) via Eq. (2a) xv (xv, µ, σ) by Eq. (2b) hv = φ(f(xv)) and hv = φ(f( xv)) ht = Tokenize(xt) z = g(hv, ht) and z = g( hv, ht) /* Token Confidence Alignment */ pt = σ(zt), pt = σ( zt) ; // Token Prob. p = [pyt t ]T t=1, p = [ pyt t ]T t=1 ; // Token Conf. LNRCA (p, p) through Eq. (5b) LCE (p, y) in Eq. (5a) L = LCE + λLNRCA w = w η L ; // Update Param. end end |
| Open Source Code | No | We follow the official repositories2,3 to conduct the fine-tuning procedure. ... 2https://github.com/haotian-liu/LLa VA 3https://github.com/NVlabs/VILA |
| Open Datasets | Yes | We perform a comprehensive analysis on multi-task scenarios, including both open-response datasets (Flickr30k (Young et al., 2014) and COCO-Cap (Lin et al., 2014)) and fixed-choice datasets (Science QA (Lu et al., 2022) and Icon QA(Lu et al., 2021)). |
| Dataset Splits | No | For each dataset, we randomly sample 10k instances from the training set. We randomly select datasets from the above mentioned captioning and VQA tasks to construct the multi-task training datasets. |
| Hardware Specification | Yes | Regarding the experimental conditions, all experiments are conducted on 8 NVIDIA 4090 GPUs, each with 24GB of memory. |
| Software Dependencies | No | We follow the official repositories2,3 to conduct the fine-tuning procedure. |
| Experiment Setup | Yes | For the visual module, we freeze the vision encoder and tune visual connector module φ. For LLM aspect, LLM includes several transformer blocks and selects the candidate block layers set N for optimization as g[L]. Thus, we obtain learnable modules as w = {φ, g[N]}. The learning rate lr in LLa VA (Liu et al., 2023b) is 2e 4 for LLM and 2e 5 for the visual projector. For VILA(Lin et al., 2023), we uniformly set the learning rate to 1e 4. The training epochs are set to E = 3 and E = 5. The training batch size B is set to 16 by default. The fine-tuning block for LLM is the last N = 2 layers. The δ denotes the noisy ratio and is default set as 0.5. We set λ = 2 and provide the corresponding ablation analysis in Sec. 4.2. |