Module-wise Adaptive Distillation for Multimodality Foundation Models

Authors: Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, Tianyi Zhou

NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the Co Ca-Large model [48] as the teacher model.
Researcher Affiliation Collaboration Chen Liang Georgia Tech EMAIL Jiahui Yu Google Research EMAIL Ming-Hsuan Yang UC Merced, Google Research EMAIL Matthew Brown Google Research EMAIL Yin Cui NVIDIA Research EMAIL Tuo Zhao Georgia Tech EMAIL Boqing Gong Google Research EMAIL Tianyi Zhou University of Maryland, College Park EMAIL
Pseudocode Yes Algorithm 1 OPTIMA: Module Adaptive Distillation
Open Source Code No The paper does not provide any explicit statements about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We conduct task-specific distillation on three multimodal understanding tasks: visual question answering (VQA, [14]), visual entailment (SNLI-VE, [47]), and visual reasoning (NLVR2, [37]). We further train and evaluate the model using the Microsoft COCO Caption dataset [6] and the Karpathy-test split, respectively.
Dataset Splits Yes For the VQA task, we conduct downstream fine-tuning and testing on the VQA 2.0 dataset [14], which consists of 83k images and 444k questions for training, 41k images, and 214k questions for validation. For the image captioning task on COCO, we use [6] for training and testing. It contains 11k images for training and 5k images for validation and 5k images for testing.
Hardware Specification Yes We also extend our thanks to the TPU team for providing abundant computational infrastructure and resources.
Software Dependencies No The paper mentions software components like "Adafactor with decoupled weight decay" (an optimizer) and "sentence-piece model" but does not specify version numbers for these or other key software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For all tasks, we train the student for T = 100k steps. We use Adafactor with decoupled weight decay [34] as the optimizer with β = (0.9, 0.999) and a learning rate of 1 10 3 with a linear decay schedule. We set α1 = 0, α2 = 1 and α3 = 1 10 2 for all tasks. For OPTIMA, we set γ = 0.98, T0 = 10, P = 100 and T = T /P = 1k. Full details are deferred to Appendix A.4.