Module-wise Adaptive Distillation for Multimodality Foundation Models
Authors: Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, Tianyi Zhou
NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the Co Ca-Large model [48] as the teacher model. |
| Researcher Affiliation | Collaboration | Chen Liang Georgia Tech EMAIL Jiahui Yu Google Research EMAIL Ming-Hsuan Yang UC Merced, Google Research EMAIL Matthew Brown Google Research EMAIL Yin Cui NVIDIA Research EMAIL Tuo Zhao Georgia Tech EMAIL Boqing Gong Google Research EMAIL Tianyi Zhou University of Maryland, College Park EMAIL |
| Pseudocode | Yes | Algorithm 1 OPTIMA: Module Adaptive Distillation |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We conduct task-specific distillation on three multimodal understanding tasks: visual question answering (VQA, [14]), visual entailment (SNLI-VE, [47]), and visual reasoning (NLVR2, [37]). We further train and evaluate the model using the Microsoft COCO Caption dataset [6] and the Karpathy-test split, respectively. |
| Dataset Splits | Yes | For the VQA task, we conduct downstream fine-tuning and testing on the VQA 2.0 dataset [14], which consists of 83k images and 444k questions for training, 41k images, and 214k questions for validation. For the image captioning task on COCO, we use [6] for training and testing. It contains 11k images for training and 5k images for validation and 5k images for testing. |
| Hardware Specification | Yes | We also extend our thanks to the TPU team for providing abundant computational infrastructure and resources. |
| Software Dependencies | No | The paper mentions software components like "Adafactor with decoupled weight decay" (an optimizer) and "sentence-piece model" but does not specify version numbers for these or other key software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For all tasks, we train the student for T = 100k steps. We use Adafactor with decoupled weight decay [34] as the optimizer with β = (0.9, 0.999) and a learning rate of 1 10 3 with a linear decay schedule. We set α1 = 0, α2 = 1 and α3 = 1 10 2 for all tasks. For OPTIMA, we set γ = 0.98, T0 = 10, P = 100 and T = T /P = 1k. Full details are deferred to Appendix A.4. |