ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts
Authors: Yuanchen Wu, Junlong Du, Ke Yan, Shouhong Ding, Xiaoqiang Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results across various VL tasks demonstrate that the proposed To VE achieves competitive performance with two orders of magnitude fewer training data. |
| Researcher Affiliation | Collaboration | 1School of Computer Engineering and Science, Shanghai University, Shanghai 2Tencent Youtu Lab, Shanghai EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology in natural language and mathematical formulas, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | The pre-training dataset is composed of two in-domain datasets (i.e., COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2017)) and one web dataset (i.e., CC3M (Sharma et al., 2018)). |
| Dataset Splits | Yes | Fine-tuned caption performance on COCO (Karpathy split) and No Caps (validation set). Fine-tuned VQA performance on VQA v2 (test set). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models or CPU specifications. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer, but it does not specify version numbers for any software libraries or frameworks like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | All our models are trained using the Adam W optimizer with a weight decay of 0.05. Automated data augmentation (Auto Aug) is applied during both the pre-training and fine-tuning stages. For pre-training, the learning rate is set to 3e-4 with a total of 10 epochs. During fine-tuning for VQA, we use a learning rate of 1e-5 and train for 10 epochs. For fine-tuning the captioning model, the learning rate is 1e-5 with a total of 3 epochs. |