Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models
Authors: Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on multiple CLIP-benchmark (LAION-AI, 2022) and probing bench (Covert et al., 2025) reveal that the CLIP after alignment demonstrates improved accuracy in zero-shot classification and dense prediction tasks, without requiring finetuning of the text encoder. Subsequently, we integrate the aligned CLIP vision encoder into two pre-trained MLLMs, LLa VA (Liu et al., 2024) and Open Flamingo (Awadalla et al., 2023), evaluating their performance across several standard VQA benchmarks. This also results in significant enhancements over the original CLIP, even without finetuning the large language model (LLM) component. Our main contributions can be summarized as follows. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China. Correspondence to: Shizhan Gong <EMAIL>, Yankai Jiang <EMAIL>, Qi Dou <EMAIL>, Farzan Farnia <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and textual explanations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are available at https: //github.com/peterant330/KUEA. |
| Open Datasets | Yes | We utilize the training set of Image Net1K (Deng et al., 2009) as our training data, which contains 1.28M images. We experiments on a diverse benchmarks composed of 12 datasets, including (1) common objects: Image Net (Deng et al., 2009), CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Caltech101 (Fei-Fei et al., 2004); (2) fine-grained objects: Oxford Pets (Parkhi et al., 2012), DTD (Cimpoi et al., 2014), FER2013 (Goodfellow et al., 2013); (3) domain-specific applications: PCAM (Veeling et al., 2018), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2018); and (4) out-of-distribution benchmarks: Image Net-O (Hendrycks et al., 2021), Image Net-Sketch (Wang et al., 2019). These experiments were conducted on both the Flicker30K (Young et al., 2014) and MSCOCO (Chen et al., 2015) datasets. We conduct experiments on four related benchmarks: (1) SVHN (Netzer et al., 2011); (2) GTSRB (Stallkamp et al., 2012); (3) CLEVR Distance (Johnson et al., 2017); and (4) CLEVR Counts. We utilize diverse benchmarks which comprises of multiple tasks including (1) openended visual question answering: VQAv2 (Goyal et al., 2017) and Text VQA (Singh et al., 2019); (2) localization: Ref COCO, Ref COCO+, Ref COCOg (Kazemzadeh et al., 2014; Yu et al., 2016); and (3) closed-set prediction: VSR (Liu et al., 2023), Tally QA (Acharya et al., 2019), POPE (Li et al., 2023b), and AI2D (Kembhavi et al., 2016). We also perform GPT-aided evaluation, LLa VA-bench (Liu et al., 2024). We follow the evaluation pipeline of the original paper, which test the in-context-learning ability of the MLLMs in several VQA benchmarks, including COCO (Chen et al., 2015), Flicker-30K (Young et al., 2014), VQAv2 (Goyal et al., 2017), OK-VQA (Marino et al., 2019), Text VQA (Singh et al., 2019), Viz Wiz (Gurari et al., 2018), and Hateful Memes (Kiela et al., 2020). |
| Dataset Splits | Yes | We utilize the training set of Image Net1K (Deng et al., 2009) as our training data, which contains 1.28M images. We adopt the standard CLIP-Benchmark (LAION-AI, 2022) as the pipeline for evaluation. The experiments are conducted on the MSCOCO dataset (Lin et al., 2014). Following the original setup, we report the macro-averaged recall to account for class imbalances. For the training data, we utilize the LLa VA-1.5 data mixture (Liu et al., 2024), which contains 665k examples and is the tuning dataset for the original LLa VA. For each dataset, we sample a few in-context demonstrations from the training split uniformly at random, and prompt the model to give answers to the test samples. In Appendix C.7, we perform ablation studies using 25% and 50% of the Image Net dataset for fine-tuning. |
| Hardware Specification | Yes | With two 4090 GPUs, the alignment of Vi T-L-14 takes around 30 hours, which is efficient and hardwarefriendly compared to the pre-training phase of CLIP. All the experiments are conducted on NVIDIA Ge Force RTX 4090 GPUs. The training is executed in bf16 format across four NVIDIA Ge Force RTX 4090 GPUs, with a batch size of 1 per device. |
| Software Dependencies | No | The paper mentions optimizers like Adam W (Loshchilov, 2017) and refers to the official implementation from LLaVA for LoRA fine-tuning, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The kernel we use is the normalized polynomial kernel of degree 3, which has been commonly adopted by several well-known studies in the literature (Stein et al., 2023; Kang et al., 2023). We also explore different kernel choices in the ablation studies. For DINOv2, we set the hyper-parameter of kernel to γ = 1/dimemb and c = 1, while for CLIP, they are set as trainable. More detailed settings for each experiment can be found in the Appendix B.1. Table 7. Detailed hyper-parameter setups. Hyper-parameters Vi T-B-16 Vi T-L-14 Vi T-L-14-336: coefficient w 0.5 0.5 1.0, number of GPUs 2 2 4, batch size 128 64 32, training epochs 2 2 4, optimizer Adam W (Loshchilov, 2017), weight decay 1e-4, β (0.9, 0.999), learning rate 1e-5, scheduler Cosine Annealing LR, warm-up steps 1400 2800 5600. In this section, we elaborate on the implementation details of the LLM fine-tuning of LLa VA, which we used to further demonstrate the enhancement of the vision encoder with alignment. We employ the official implementation from LLa VA for Lo RA fine-tuning. The training is conducted on a mixture of LLa VA-1.5 data for one epoch, using the following Lo RA configuration: r = 128 and α = 256. The training is executed in bf16 format across four NVIDIA Ge Force RTX 4090 GPUs, with a batch size of 1 per device. To address the small batch size, we apply a gradient accumulation step of 32. The optimizer used is Adam W (Loshchilov, 2017), set with a learning rate of 2e-4 and a weight decay of 0. |