Be Confident: Uncovering Overfitting in MLLM Multi-Task Tuning

Authors: Wenke Huang, Jian Liang, Guancheng Wan, Didi Zhu, He Li, Jiawei Shao, Mang Ye, Bo Du, Dacheng Tao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive empirical evaluations across diverse multi-task downstream via popular MLLM architectures. The comprehensive experiment demonstrates our effectiveness, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task performance. We perform a comprehensive analysis on multi-task scenarios, including both open-response datasets (Flickr30k (Young et al., 2014) and COCO-Cap (Lin et al., 2014)) and fixed-choice datasets (Science QA (Lu et al., 2022) and Icon QA(Lu et al., 2021)). Experiments are conducted on the VILA (Lin et al., 2023) and LLa VA (Liu et al., 2023b) architectures. Through a series of ablation studies, the promising results empirically validate the effectiveness of NRCA in alleviating open-response overfitting and enhancing multi-task fine-tuning overall performance.
Researcher Affiliation Collaboration 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China 2Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Institute of Artificial Intelligence (Tele AI), China Telecom, China 4Nanyang Technological University, Singapore. Correspondence to: Mang Ye <EMAIL>, Bo Du <EMAIL>.
Pseudocode Yes We provide the algorithm description in Algorithm 1. Algorithm 1 NRCA Input: Fine-Tuning epoch E, Downstream dataset D Overall MLLM Network θ, Trainable parameter module w Output: The optimized selected MLLM module w for e = 1, 2, ..., E do for (xv, xt, y) D do /* Construct Noisy Vision View */ µ, σ (xv) via Eq. (2a) xv (xv, µ, σ) by Eq. (2b) hv = φ(f(xv)) and hv = φ(f( xv)) ht = Tokenize(xt) z = g(hv, ht) and z = g( hv, ht) /* Token Confidence Alignment */ pt = σ(zt), pt = σ( zt) ; // Token Prob. p = [pyt t ]T t=1, p = [ pyt t ]T t=1 ; // Token Conf. LNRCA (p, p) through Eq. (5b) LCE (p, y) in Eq. (5a) L = LCE + λLNRCA w = w η L ; // Update Param. end end
Open Source Code No We follow the official repositories2,3 to conduct the fine-tuning procedure. ... 2https://github.com/haotian-liu/LLa VA 3https://github.com/NVlabs/VILA
Open Datasets Yes We perform a comprehensive analysis on multi-task scenarios, including both open-response datasets (Flickr30k (Young et al., 2014) and COCO-Cap (Lin et al., 2014)) and fixed-choice datasets (Science QA (Lu et al., 2022) and Icon QA(Lu et al., 2021)).
Dataset Splits No For each dataset, we randomly sample 10k instances from the training set. We randomly select datasets from the above mentioned captioning and VQA tasks to construct the multi-task training datasets.
Hardware Specification Yes Regarding the experimental conditions, all experiments are conducted on 8 NVIDIA 4090 GPUs, each with 24GB of memory.
Software Dependencies No We follow the official repositories2,3 to conduct the fine-tuning procedure.
Experiment Setup Yes For the visual module, we freeze the vision encoder and tune visual connector module φ. For LLM aspect, LLM includes several transformer blocks and selects the candidate block layers set N for optimization as g[L]. Thus, we obtain learnable modules as w = {φ, g[N]}. The learning rate lr in LLa VA (Liu et al., 2023b) is 2e 4 for LLM and 2e 5 for the visual projector. For VILA(Lin et al., 2023), we uniformly set the learning rate to 1e 4. The training epochs are set to E = 3 and E = 5. The training batch size B is set to 16 by default. The fine-tuning block for LLM is the last N = 2 layers. The δ denotes the noisy ratio and is default set as 0.5. We set λ = 2 and provide the corresponding ablation analysis in Sec. 4.2.