reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Be Confident: Uncovering Overfitting in MLLM Multi-Task Tuning

Authors: Wenke Huang, Jian Liang, Guancheng Wan, Didi Zhu, He Li, Jiawei Shao, Mang Ye, Bo Du, Dacheng Tao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive empirical evaluations across diverse multi-task downstream via popular MLLM architectures. The comprehensive experiment demonstrates our effectiveness, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task performance. We perform a comprehensive analysis on multi-task scenarios, including both open-response datasets (Flickr30k (Young et al., 2014) and COCO-Cap (Lin et al., 2014)) and fixed-choice datasets (Science QA (Lu et al., 2022) and Icon QA(Lu et al., 2021)). Experiments are conducted on the VILA (Lin et al., 2023) and LLa VA (Liu et al., 2023b) architectures. Through a series of ablation studies, the promising results empirically validate the effectiveness of NRCA in alleviating open-response overfitting and enhancing multi-task fine-tuning overall performance.
Researcher Affiliation	Collaboration	1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China 2Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Institute of Artificial Intelligence (Tele AI), China Telecom, China 4Nanyang Technological University, Singapore. Correspondence to: Mang Ye <EMAIL>, Bo Du <EMAIL>.
Pseudocode	Yes	We provide the algorithm description in Algorithm 1. Algorithm 1 NRCA Input: Fine-Tuning epoch E, Downstream dataset D Overall MLLM Network θ, Trainable parameter module w Output: The optimized selected MLLM module w for e = 1, 2, ..., E do for (xv, xt, y) D do /* Construct Noisy Vision View / µ, σ (xv) via Eq. (2a) xv (xv, µ, σ) by Eq. (2b) hv = φ(f(xv)) and hv = φ(f( xv)) ht = Tokenize(xt) z = g(hv, ht) and z = g( hv, ht) / Token Confidence Alignment */ pt = σ(zt), pt = σ( zt) ; // Token Prob. p = [pyt t ]T t=1, p = [ pyt t ]T t=1 ; // Token Conf. LNRCA (p, p) through Eq. (5b) LCE (p, y) in Eq. (5a) L = LCE + λLNRCA w = w η L ; // Update Param. end end
Open Source Code	No	We follow the official repositories2,3 to conduct the fine-tuning procedure. ... 2https://github.com/haotian-liu/LLa VA 3https://github.com/NVlabs/VILA
Open Datasets	Yes	We perform a comprehensive analysis on multi-task scenarios, including both open-response datasets (Flickr30k (Young et al., 2014) and COCO-Cap (Lin et al., 2014)) and fixed-choice datasets (Science QA (Lu et al., 2022) and Icon QA(Lu et al., 2021)).
Dataset Splits	No	For each dataset, we randomly sample 10k instances from the training set. We randomly select datasets from the above mentioned captioning and VQA tasks to construct the multi-task training datasets.
Hardware Specification	Yes	Regarding the experimental conditions, all experiments are conducted on 8 NVIDIA 4090 GPUs, each with 24GB of memory.
Software Dependencies	No	We follow the official repositories2,3 to conduct the fine-tuning procedure.
Experiment Setup	Yes	For the visual module, we freeze the vision encoder and tune visual connector module φ. For LLM aspect, LLM includes several transformer blocks and selects the candidate block layers set N for optimization as g[L]. Thus, we obtain learnable modules as w = {φ, g[N]}. The learning rate lr in LLa VA (Liu et al., 2023b) is 2e 4 for LLM and 2e 5 for the visual projector. For VILA(Lin et al., 2023), we uniformly set the learning rate to 1e 4. The training epochs are set to E = 3 and E = 5. The training batch size B is set to 16 by default. The fine-tuning block for LLM is the last N = 2 layers. The δ denotes the noisy ratio and is default set as 0.5. We set λ = 2 and provide the corresponding ablation analysis in Sec. 4.2.