reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Authors: Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments conducted on multiple CLIP-benchmark (LAION-AI, 2022) and probing bench (Covert et al., 2025) reveal that the CLIP after alignment demonstrates improved accuracy in zero-shot classification and dense prediction tasks, without requiring finetuning of the text encoder. Subsequently, we integrate the aligned CLIP vision encoder into two pre-trained MLLMs, LLa VA (Liu et al., 2024) and Open Flamingo (Awadalla et al., 2023), evaluating their performance across several standard VQA benchmarks. This also results in significant enhancements over the original CLIP, even without finetuning the large language model (LLM) component. Our main contributions can be summarized as follows.
Researcher Affiliation	Collaboration	1Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China. Correspondence to: Shizhan Gong <EMAIL>, Yankai Jiang <EMAIL>, Qi Dou <EMAIL>, Farzan Farnia <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical formulations and textual explanations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models are available at https: //github.com/peterant330/KUEA.
Open Datasets	Yes	We utilize the training set of Image Net1K (Deng et al., 2009) as our training data, which contains 1.28M images. We experiments on a diverse benchmarks composed of 12 datasets, including (1) common objects: Image Net (Deng et al., 2009), CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Caltech101 (Fei-Fei et al., 2004); (2) fine-grained objects: Oxford Pets (Parkhi et al., 2012), DTD (Cimpoi et al., 2014), FER2013 (Goodfellow et al., 2013); (3) domain-specific applications: PCAM (Veeling et al., 2018), RESISC45 (Cheng et al., 2017), Euro SAT (Helber et al., 2018); and (4) out-of-distribution benchmarks: Image Net-O (Hendrycks et al., 2021), Image Net-Sketch (Wang et al., 2019). These experiments were conducted on both the Flicker30K (Young et al., 2014) and MSCOCO (Chen et al., 2015) datasets. We conduct experiments on four related benchmarks: (1) SVHN (Netzer et al., 2011); (2) GTSRB (Stallkamp et al., 2012); (3) CLEVR Distance (Johnson et al., 2017); and (4) CLEVR Counts. We utilize diverse benchmarks which comprises of multiple tasks including (1) openended visual question answering: VQAv2 (Goyal et al., 2017) and Text VQA (Singh et al., 2019); (2) localization: Ref COCO, Ref COCO+, Ref COCOg (Kazemzadeh et al., 2014; Yu et al., 2016); and (3) closed-set prediction: VSR (Liu et al., 2023), Tally QA (Acharya et al., 2019), POPE (Li et al., 2023b), and AI2D (Kembhavi et al., 2016). We also perform GPT-aided evaluation, LLa VA-bench (Liu et al., 2024). We follow the evaluation pipeline of the original paper, which test the in-context-learning ability of the MLLMs in several VQA benchmarks, including COCO (Chen et al., 2015), Flicker-30K (Young et al., 2014), VQAv2 (Goyal et al., 2017), OK-VQA (Marino et al., 2019), Text VQA (Singh et al., 2019), Viz Wiz (Gurari et al., 2018), and Hateful Memes (Kiela et al., 2020).
Dataset Splits	Yes	We utilize the training set of Image Net1K (Deng et al., 2009) as our training data, which contains 1.28M images. We adopt the standard CLIP-Benchmark (LAION-AI, 2022) as the pipeline for evaluation. The experiments are conducted on the MSCOCO dataset (Lin et al., 2014). Following the original setup, we report the macro-averaged recall to account for class imbalances. For the training data, we utilize the LLa VA-1.5 data mixture (Liu et al., 2024), which contains 665k examples and is the tuning dataset for the original LLa VA. For each dataset, we sample a few in-context demonstrations from the training split uniformly at random, and prompt the model to give answers to the test samples. In Appendix C.7, we perform ablation studies using 25% and 50% of the Image Net dataset for fine-tuning.
Hardware Specification	Yes	With two 4090 GPUs, the alignment of Vi T-L-14 takes around 30 hours, which is efficient and hardwarefriendly compared to the pre-training phase of CLIP. All the experiments are conducted on NVIDIA Ge Force RTX 4090 GPUs. The training is executed in bf16 format across four NVIDIA Ge Force RTX 4090 GPUs, with a batch size of 1 per device.
Software Dependencies	No	The paper mentions optimizers like Adam W (Loshchilov, 2017) and refers to the official implementation from LLaVA for LoRA fine-tuning, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	The kernel we use is the normalized polynomial kernel of degree 3, which has been commonly adopted by several well-known studies in the literature (Stein et al., 2023; Kang et al., 2023). We also explore different kernel choices in the ablation studies. For DINOv2, we set the hyper-parameter of kernel to γ = 1/dimemb and c = 1, while for CLIP, they are set as trainable. More detailed settings for each experiment can be found in the Appendix B.1. Table 7. Detailed hyper-parameter setups. Hyper-parameters Vi T-B-16 Vi T-L-14 Vi T-L-14-336: coefficient w 0.5 0.5 1.0, number of GPUs 2 2 4, batch size 128 64 32, training epochs 2 2 4, optimizer Adam W (Loshchilov, 2017), weight decay 1e-4, β (0.9, 0.999), learning rate 1e-5, scheduler Cosine Annealing LR, warm-up steps 1400 2800 5600. In this section, we elaborate on the implementation details of the LLM fine-tuning of LLa VA, which we used to further demonstrate the enhancement of the vision encoder with alignment. We employ the official implementation from LLa VA for Lo RA fine-tuning. The training is conducted on a mixture of LLa VA-1.5 data for one epoch, using the following Lo RA configuration: r = 128 and α = 256. The training is executed in bf16 format across four NVIDIA Ge Force RTX 4090 GPUs, with a batch size of 1 per device. To address the small batch size, we apply a gradient accumulation step of 32. The optimizer used is Adam W (Loshchilov, 2017), set with a learning rate of 2e-4 and a weight decay of 0.