Probabilistic Language-Image Pre-Training
Authors: Sanghyuk Chun, Wonjae Kim, Song Park, Sangdoo Yun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that, by leveraging uncertainty estimates, Pro LIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve Image Net accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip. In the experiments, Pro LIP slightly outperforms the deterministic CLIP model in zero-shot classification (ZSC) tasks (e.g., CLIP shows 67.2 Image Net ZSC accuracy, while Pro LIP shows 67.6). |
| Researcher Affiliation | Industry | Sanghyuk Chun Wonjae Kim Song Park Sangdoo Yun NAVER AI Lab |
| Pseudocode | Yes | Due to the page limit, we describe the detailed algorithm in Appendix A.8. Algorithm 1 Bayesian Prompt Re-Weighting (BPRW) |
| Open Source Code | Yes | The code is available at https://github.com/naver-ai/prolip. Our results can be reproducible by our open-source implementation (https://github.com/naver-ai/ prolip) and released pre-trained weights in Hugging Face (https://huggingface.co/ collections/Sanghyuk Chun/prolip-6712595dfc87fd8597350291); |
| Open Datasets | Yes | We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We mainly use the Data Comp 1B dataset (Gadre et al., 2024), a filtered version of the LAION-5B dataset (Schuhmann et al., 2022), as our training dataset. |
| Dataset Splits | Yes | We evaluate the models on 38 tasks of Datacomp evaluation suite (Gadre et al., 2024) the full evaluation datasets are listed in Appendix B.3. We use multiple prompts for each task following the Data Comp evaluation suite. We construct the hierarchical image dataset by using the validation set of the COCO dataset (Lin et al., 2014). |
| Hardware Specification | Yes | We train Pro LIP models using 32 NVIDIA H100 GPUs with Bfloat16 precision, taking about one day to train a Vi T-B/16 model with 1.28B seen samples. |
| Software Dependencies | No | We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We use the Adam W (Kingma & Ba, 2015) optimizer following the official openclip implementation. The paper mentions software tools like openclip and Adam W but does not specify their version numbers. |
| Experiment Setup | Yes | We use a learning rate of 0.0005, beta1 of 0.9, beta2 of 0.95, weight decay of 0.2, and batch size of 512 for each GPU (i.e., the full batch size is 512 * 32 = 16384). We apply 10000 warmup steps and then the learning rate is decayed by cosine learning rate scheduling. We use image augmentations of scaling 0.8 to 1.0, color jittering and grayscale. Among the mini-batch, we select 12.5% image-text pairs and drop 75% tokens of them to compute Linc(x xmasked). In all experiments, we fix α1 and α2 in Equation (6) as 10e-7 and 0.001, respectively. |