reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Probabilistic Language-Image Pre-Training

Authors: Sanghyuk Chun, Wonjae Kim, Song Park, Sangdoo Yun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that, by leveraging uncertainty estimates, Pro LIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve Image Net accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip. In the experiments, Pro LIP slightly outperforms the deterministic CLIP model in zero-shot classification (ZSC) tasks (e.g., CLIP shows 67.2 Image Net ZSC accuracy, while Pro LIP shows 67.6).
Researcher Affiliation	Industry	Sanghyuk Chun Wonjae Kim Song Park Sangdoo Yun NAVER AI Lab
Pseudocode	Yes	Due to the page limit, we describe the detailed algorithm in Appendix A.8. Algorithm 1 Bayesian Prompt Re-Weighting (BPRW)
Open Source Code	Yes	The code is available at https://github.com/naver-ai/prolip. Our results can be reproducible by our open-source implementation (https://github.com/naver-ai/ prolip) and released pre-trained weights in Hugging Face (https://huggingface.co/ collections/Sanghyuk Chun/prolip-6712595dfc87fd8597350291);
Open Datasets	Yes	We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We mainly use the Data Comp 1B dataset (Gadre et al., 2024), a filtered version of the LAION-5B dataset (Schuhmann et al., 2022), as our training dataset.
Dataset Splits	Yes	We evaluate the models on 38 tasks of Datacomp evaluation suite (Gadre et al., 2024) the full evaluation datasets are listed in Appendix B.3. We use multiple prompts for each task following the Data Comp evaluation suite. We construct the hierarchical image dataset by using the validation set of the COCO dataset (Lin et al., 2014).
Hardware Specification	Yes	We train Pro LIP models using 32 NVIDIA H100 GPUs with Bfloat16 precision, taking about one day to train a Vi T-B/16 model with 1.28B seen samples.
Software Dependencies	No	We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We use the Adam W (Kingma & Ba, 2015) optimizer following the official openclip implementation. The paper mentions software tools like openclip and Adam W but does not specify their version numbers.
Experiment Setup	Yes	We use a learning rate of 0.0005, beta1 of 0.9, beta2 of 0.95, weight decay of 0.2, and batch size of 512 for each GPU (i.e., the full batch size is 512 * 32 = 16384). We apply 10000 warmup steps and then the learning rate is decayed by cosine learning rate scheduling. We use image augmentations of scaling 0.8 to 1.0, color jittering and grayscale. Among the mini-batch, we select 12.5% image-text pairs and drop 75% tokens of them to compute Linc(x xmasked). In all experiments, we fix α1 and α2 in Equation (6) as 10e-7 and 0.001, respectively.