Probabilistic Language-Image Pre-Training

Authors: Sanghyuk Chun, Wonjae Kim, Song Park, Sangdoo Yun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that, by leveraging uncertainty estimates, Pro LIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve Image Net accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip. In the experiments, Pro LIP slightly outperforms the deterministic CLIP model in zero-shot classification (ZSC) tasks (e.g., CLIP shows 67.2 Image Net ZSC accuracy, while Pro LIP shows 67.6).
Researcher Affiliation Industry Sanghyuk Chun Wonjae Kim Song Park Sangdoo Yun NAVER AI Lab
Pseudocode Yes Due to the page limit, we describe the detailed algorithm in Appendix A.8. Algorithm 1 Bayesian Prompt Re-Weighting (BPRW)
Open Source Code Yes The code is available at https://github.com/naver-ai/prolip. Our results can be reproducible by our open-source implementation (https://github.com/naver-ai/ prolip) and released pre-trained weights in Hugging Face (https://huggingface.co/ collections/Sanghyuk Chun/prolip-6712595dfc87fd8597350291);
Open Datasets Yes We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We mainly use the Data Comp 1B dataset (Gadre et al., 2024), a filtered version of the LAION-5B dataset (Schuhmann et al., 2022), as our training dataset.
Dataset Splits Yes We evaluate the models on 38 tasks of Datacomp evaluation suite (Gadre et al., 2024) the full evaluation datasets are listed in Appendix B.3. We use multiple prompts for each task following the Data Comp evaluation suite. We construct the hierarchical image dataset by using the validation set of the COCO dataset (Lin et al., 2014).
Hardware Specification Yes We train Pro LIP models using 32 NVIDIA H100 GPUs with Bfloat16 precision, taking about one day to train a Vi T-B/16 model with 1.28B seen samples.
Software Dependencies No We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We use the Adam W (Kingma & Ba, 2015) optimizer following the official openclip implementation. The paper mentions software tools like openclip and Adam W but does not specify their version numbers.
Experiment Setup Yes We use a learning rate of 0.0005, beta1 of 0.9, beta2 of 0.95, weight decay of 0.2, and batch size of 512 for each GPU (i.e., the full batch size is 512 * 32 = 16384). We apply 10000 warmup steps and then the learning rate is decayed by cosine learning rate scheduling. We use image augmentations of scaling 0.8 to 1.0, color jittering and grayscale. Among the mini-batch, we select 12.5% image-text pairs and drop 75% tokens of them to compute Linc(x xmasked). In all experiments, we fix α1 and α2 in Equation (6) as 10e-7 and 0.001, respectively.